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Introduction to Computer Systems 
Some processor otherwise use the name program 
status word (PSW). 


Introduction to Computer Systems 


Overview of the History of the Digital Computer 


Early development of computers centers on 
mechanical calculators, which are built to perform 
basic arithmetic operations, such as the Sumerian 
abacus designed around 2500 BC, and the 
mechanical calculator, which could perform 
arithmetic operations invented in the Renaissance. 
The significance of the mechanical calculator is 
twofold. First, in trying to develop more powerful 
and more flexible calculators, Charles Babbage is 
the first to theorize computers. Second, the design of 
the mechanical calculator has led to Intel’s 
development for a low-cost microprocessor, an 
electronic version of the mechanical calculator, 
which is the first commercially available 
microprocessor. Digital computers are made of 
electronic components. Vacuum tubes are first used 
to build computers. However, their huge size and 
power consumption draw computer engineers’ 
attention to seek a compact and less power 
consumption device. Therefore, after the first 
success of building computers, transistors are used 


to build computer, followed by large scale 
integration circuit (LSI), and very larger scale 
integrated circuit (VLSD. 


Vacuum Tubes 


Modern computers started from the development of 
ENIAC (Electronic Numerical Integrator and 
Computer) using vacuum tubes, built by John 
Mauchley and J. Presper Eckert of the university of 
Pennsylvania, in 1946. ENIAC was originally 
designed to compute artillery firing tables for the 
United States Army. It is able to perform arithmetic 
operations upon ten-digit decimal numbers. It is a 
modular design with a branch capability based on 
the sign of a computation result. 


ENIAC was built from vacuum tubes, huge, and 
consumes a lot of power. It is composted of 17,468 
vacuum tubes, 7,200 crystal diodes, 1,500 relays, 
70,000 resistors, 10,000 capacitors and around 5 
million hand-soldered joints. It takes up 1800 
square feet, and consumes 150 kW of power. Input/ 
output may be from an IBM card reader, and an IBM 
card punch. 


ENIAC could perform 5,000 simple addition or 
subtraction operations per second, about 385 
multiplication operations per second, 40 division 
operations per second, and 3 square root operations 
per second. Programs may include sequential 


arithmetic operations, loops, and subroutines. 
However, programming takes weeks. First a 
problem in question is mapped to a program on 
paper. Second, the program is transferred into 
ENIAC via its switches, and cables. Finally, the 
program is debugged by single step execution on 
ENIAC. The lack of the ability to store programs in 
the machine has led to the development of John von 
Neumann machines by John von Neumann. 


Stored Program Computer 


Before the ENIAC was fully functional, John 
Mauchly and J. Presper Eckert had started working 
on the project, Electronic Discrete Variable 
Automatic Computer (EDVAC), which was binary 
rather than decimal, and was a stored program 
computer. John von Neumann participated in that 
project as a consultant. In 1945, John von Neumann 
summarized and elaborated the logic design and 
development of EDVAC in his article entitled “First 
Draft of a Report on the EDVAC” which was known 
as Von Neumann architecture (a.k.a. Von Neumann 
model). In this architecture, a computer system is 
composed of a process unit, which contains an 
arithmetic logic unit (ALU), registers, a control unit 
containing an instructor register, a program counter, 
a memory that stores both instructions and data, 
external mass storage, and I/O mechanisms. The 
memory sharing of instructions and data causes the 
instruction fetch and data fetch not to occur at the 


same time, resulting in a performance limit. The 
modern stored-program architecture known as 
Harvard architecture enhances the performance by 
dedicating one bus for instructions and another one 
for data. 


Another branch of the stored program architecture 
research was conducted by Soviet scientists Sergei 
Sobolev and Nikolay Brusentsov on ternary 
computers, which were operated on a base three 
numerical system of —1, 0, and 1 rather than the 
binary numerical system, upon which most 
contemporary computers are based. This computer, 
however, was turned to the binary numerical system 
later. 


Transistor 


Vacuum tubes were used to build computers in 
1950s. With the advent in electronic components, 
transistors replaced vacuum tubes in building 
electronic devices including computers in 1960s. 
Transistors are cheaper, faster, less power, and more 
reliable than vacuum tubes. The first transistor 
computer was demonstrated by the University of 
Manchester in 1953. That machine was composed of 
550 diodes, and 92 point-contact transistors, the 
first type of solid-state electronic transistor 
constructed by researchers John Bardeen and Walter 
Houser Brattain at Bell Laboratories. Each word in 
the machine is of length 48 bits. Its clock generator 


was constructed of a few vacuum tubes. 


Integrated Circuit 


In the 1970s, integrated circuit (IC) technology had 
further reduced the size and cost of computers with 
an increasing computation speed. Microcontrollers 
were vastly used in appliances such as video 
recorders, laundry machines. Robert Noyce invented 
the silicon integrated circuit in 1958. Semiconductor 
devices are observed to perform similar functions as 
vacuum tubes. The advancement in device 
fabrication allows the packing of a large number of 
transistors in a small chip, which would no longer 
require assembling discrete electronic components 
manually. The mass production of ICs results in 
reliable, modularized, low cost produces, in which 
blocks of transistors in the circuit are replaced with 
corresponding ICs. It low cost is due to the chips 
with their components are printed as a unit by 
photolithography rather than being produced one 
transistor at a time. Substance used in IC is less in 
terms of area per unit transistor. The performance of 
IC is better because it consumes less power and 
switches quickly for the closer components within. 
In this era, IBM’s manufactured two highly 
successful machines such as the 7094 and 1401, 
which were replaced by IBM System/360 Model 75 
and Model 30 later, respectively. Intel’s MCS-4 
family such as 4004, was a 4-bit CPU released in 
1971. It was the first commercially available 


complete CPU on a chip. The 4004 CPU is built of 
approximately 2,300 transistors, running up to 740 
KHz, 46 instructions (8/16-bit wide), 16 4-bit 
registers, 3-level deep subroutine stacks, and 12-bit 
address. 


Very Large Scale Integrated Circuit 


In the 1980s, further advancement in integrated 
circuit and packaging allowed putting millions of 
transistors on a single chip, known as very large 
scale integrated circuit (VLSI). This technology 
made it feasible to built fairly low cost and small 
size personal computers. The early two major 
commercially available personal computers are 
Intel’s 8080 and Apple II. The Intel’s 8080 computer 
comes with parts including the 8080 CPU, cables, 
power supply, 8” floppy disk, but without any 
software. The CP/M operating system, written by 
Gary Kildall, was prevalently running on 8080s. 
Apple II was designed by Steve Jobs and Steve 
Wozniak in their garage. The machine comes with a 
built-in Basic interpreter, which makes 
programming easier and fun. The machine is 
popular with home users and students at schools. 


Table 1 Milestones in Computer Development 


Technology | Year | Name Made by Remark 
1946 | ENIAC Eckert/Mauchley | Modern compute 
1949 | EDVAC Eckert/Mauchley __| Stored program 
Vacuum 1949 | EDSAC Wilkes Stored program c 
Tubes 1951 | WhirlwindI | MIT First real-time co 
Most computers | 
1952 | IAS von Neumann oe hiteehare 
1960 | PDP-1 DEC First minicomput 
1961 1401 IBM Popular machine 
business 
Dominated scien 
. 1962 | 7094 IBM in 1960s 
Transistors 7969 [85000 Burroughs First high-level | 
1964 | 360 IBM First product line 
family 
1964 | 6600 CDC First scientific su 
1965 | PDP-8 DEC First mass-marke 
Dominated mini 
1970 | PDP-11 DEC 1970s 
1971 4004 Intel First general 4-bi 
ICs 1974 | 8080 Intel First general 8-bi 
1974 | Cray-I Cray First vector supe: 
1978 | VAX DEC First 32-bit super 
1981 IBM-PC IBM Personal comput 
1985 | MIPS MIPS First commercial 
VLSI i 5 
1987 | SPARC Sun iui ora oe 
workstation 
1990 | RS6000 IBM First superscalar 


Introduction to Instruction Set Architecture, 


Microarchitecture and System Architecture 


Instruction Set Architecture 


An instruction set architecture (ISA) details 
programming models and computing abstraction for 
a processor. Assembly programmers and compiler 
writers create programs by consulting an ISA. An 
instruction set of a CPU dictates what a CPU can do 
and what it can do efficiently. At the first design 
stage of a CPU, an instruction set has to be 
developed. Instruction set architecture is one of the 
most important design dimensions that a CPU 
designer must get right since its onset. A good 
instruction set architecture would benefit other 
features such as code size, instruction encoding/ 
decoding, pipelining, caches, superscalar, and the 
like. A typical instruction should contain 
information such as op-code, operands, addressing 
modes, immediate values, etc. Op-code instructs the 
CPU to perform the operation such as addition, 
logical AND, etc. Operands indicate the values to be 
computed upon, including source operands and a 
target operand, where the computation result will 
be stored. Addressing modes specify where to obtain 
the operands. Normally, operands are coming from 
registers or memory. Immediate values are operands 
coming along with instruction stream. Small 
constants, e.g., are often encoded in the instruction 
to eliminate extra operands fetches. depicts a typical 
instruction. Note that the Src2 and Dst operands 


may be combined into one field. In that case, the 
computation would become 


Dst = Srci+ Dst 
instead of 
Dst = Srei + Sre2 


. Moreover, the immediate value may span across 
several fields to accommodate a larger value. 


Table 2 A Typical Instruction 


Op-code Srel Srce2 Dst Addressing | Immediate 


It is nearly impossible to implement everything in a 
CPU due to the space limit. Each instruction takes 
some silicon real estate in terms of the number of 
transistors. The Intel 4004 CPU uses about 2,300 
transistors whereas Pentium 4 uses 42,000,000 
transistors. Although the budget of transistors for 
multi-core processors is huge, e.g., 10-core Xeon 
Westmere-EX employs 2,600,000,000 transistors, 
each added feature demands quite a few of 
transistors. Thus, only the instruction that is a must 
is added to the instruction set. A good example goes 
to the subtraction instruction. Since subtraction can 
be translated to addition with two’s complement, it 
is not necessary to include a subtraction instruction. 
Instead, a two’s complement unit may be 
implemented along with an addition instruction to 


achieve subtraction. For example, 
x—y=x+y’ 

where 

y’ 


is the two’s complement of 


¥ 


The rule of thumb in hardware design is the less, the 
better. The less resource used, the more benefits. 
The benefits include 1) the cost is low, 2) the circuit 
is simple, 3) the power consumption is low, and 4) 
the verification is easy. A complicate instruction set 
would demand a lot of space, and the cost would be 
really high. In implementing a complex instruction 
set, the circuit would not be simple, and nor is the 
verification of the correctness of the CPU. A huge 
amount of transistors also means that power 
consumption will be fairly high. 


One of the difficult design concerns for an 
instruction set architecture is that what instructions 
have to be included. It would be really hard to 
predict ten years down the road what instructions 
are for popular applications. Nevertheless, most of 
the CPUs over the past may only exist for only a 


couple of year. For example, Intel’s MMX extension 
to its Pentium CPUs reflects the need for multimedia 
applications. Nobody would ever know that MMX 
should be included in Intel’s 8080 processors. 


Adding instruction to CPU is easier than taking it 
out for a number of reasons. First, there is a 
backward compatible issue. A lot of applications are 
based on the instruction set. Taking instructions out 
from the instruction set means breaking out their 
executions. On the other hand, adding in new 
instructions would not affect their executions 
because only the old set of instructions is used. This 
is called backward compatibility. Compatibility 
backward is crucial in computer industry. Even with 
the latest Windows 7, there is still a DOS windows 
that may execute a DOS program developed in ‘80s. 


An instruction set should be designed in a way that 
most of assembler or compiler writers would 
develop system programs easily. The system 
programs will help application programmers 
develop programs that run on the processors. The 
popularity of a process lies on its complexity. Simple 
wins all. A complex system would never be popular. 


Microarchitecture 


The way a given instruction set architecture (ISA) is 
implemented on a processor is called 
microarchitecture, also called computer 


organization. Computer architecture is the 
combination of the ISA and the microarchitecture. A 
microarchitecture is typically represented by a 
diagram that depicts the interconnections among 
microarchitectural components, which include 
gates, function units, registers, arithmetic logic unit 
(ALU), and the like. The diagram includes the data 
path, where data flows through components, and 
the control path, where coordination of data flows is 
under control. A schematic diagram is used to 
describe a microarchitecture element in gates level. 
Each gate is represented by transistors with their 
interconnections for a specific logic device. A given 
ISA may be implemented using different 
microarchitectures. For example, the x86 ISA has 
been implemented in processors manufactured by 
Intel, AMD, and Via. The latest VLSI technology may 
improve the performance of processors using an old 
ISA. 


System Architecture 


A system architecture is a formal abstraction model 
that describes the system components, structure, 
behavior, and relations among components. In a 
computer system that includes hardware and 
software, a system architect is concerns with the 
complete artifact, both hardware and software, and 
all the interfaces of the artifact including those 
among software components, among hardware 
components, between hardware and software, and 


between the artifact and its users. A good system 
architecture design will include requirements of the 
overall system. The requirements are mapped to a 
set of system components (hardware or software), 
which may be validated against the requirement via 
a series of system tests. To some extent, the system 
architecture specifies what have to done and what 
to expect, other than how to get things done. 
Hardware engineers and software engineers, on the 
other hand, design hardware/software components 
according to the high level system architecture. 
Instruction set architecture, and microarchitecture 
are then developed. Software engineers develop 
each software components running on the hardware 
according to the system architecture. Each system 
component is verified and tested by corresponding 
designers. The overall system after integrating 
system components is then under a series of 
thorough system tests to guarantee the system 
requirements are met. 


Processor Architecture — Instruction Types, 
Register Sets, Addressing Modes 


A typical process is composed of registers, ALU, 
instruction decoding unit, and control units. There 
are special registers to keep CPU working such as 
program counter (PC), stack pointer (SP), status 
register (SR) [footnote]. PC is holding the address of 
next instruction to be executed. SP is pointing to the 
top of the stack in memory. SR is a register that 


keeps the process state after each instruction has 
been executed. An instruction register, one of the 
registers, is used to keep the instruction currently 
under execution. The instruction register is 
connected to the instruction decoding unit, where 
the outputs are control signals that synchronize all 
the components inside CPU. 


Instruction Types 


There are various classes of instruction: data 
movement, arithmetic, bit-wise, and flow control. 
Data movement instructions transfer data from 
registers to registers, from registers to memory, 
from memory to registers, and from memory to 
memory (for some CPUs). Data movement is to 
prepare operands for the further computation, or 
simply to store data in memory when computation 
is completed. 


Arithmetic instructions include arithmetic 
operations such as addition, subtraction, 
multiplication, division, and the like. Some 
processors provide the complicated operations such 
as multiplication if space is enough. In processors 
only supporting basic arithmetic operations, the 
complex instruction, e.g., multiplication, would 
have to be emulated by either subroutines or 
macros, meaning that a software approach is needed 
to achieve the complex operations. This is a tradeoff 
between performance and cost (space). With the 


hardware complex instructions, the performance is 
high but a significant amount of hardware resource 
is required. 


Bit-wise operations contain AND, OR, NOT, XOR, 
and the like. The bit-wise operations work on each 
individual bit of data, and sometimes, may not lead 
to the same result as their corresponding logical 
operations. For example, 


Oxi and 0x2 = 0 


because 


0001 and 0010= 0000 


. If the logical false is defined as 0, and a non-zero 
value is considered as logical true, the logical AND 
operates upon Ox1 and Ox2 should be true because 


Ox1 and 0x2 = true and true = true 


. Practically, we may still use bit-wise operations for 
their corresponding logical operations if the logical 
true is limited to be 0x1. In that case, the logical 
operations will yield the same results as the bit-wise 
operations. Some special operations such as 
“remainder of dividing by 8” may be accomplished 
by the logical operations. For example, the 
remainder of a number 


divided by 16 can be computed by the following 
statement: 


x AND OxF 


The logical AND is a bit-wise operation, and the 
result of the above statement is actually taking the 
lower 4 —bits from the number x, which is exactly 
the remainder! 


Flow control instructions are composed of jumps, 
predicates, and subroutines which are required to 
implement program structures such as if-then-else, 
case-switch, loops, subroutines, etc. Jump 
instructions include unconditional jumps and 
conditional jumps. Unconditional jumps implements 
go-to statements in a loop or a conditional if 
structure. Conditional jumps involve a predicate 
instruction, which evaluates a Boolean expression or 
a relational expression. The Boolean expression or 
relational expression will be evaluated to either true 
or false. Therefore, before a conditional instruction, 
there is always a predicate instruction such as 
comparison of two numbers. The result either true 
or false is then used for the conditional jump. The 
end result of a jump instruction is the change of the 
program counter (PC). Other instructions that 
modify PC include subroutine calls. When a 
subroutine call is made, PC will be loaded with the 
beginning address of the subroutine. The program 
control is then transferred to the subroutine. By the 
time the subroutine finishes it execution, the PC will 


be loaded with the address of next instruction 
followed by the subroutine call. 


Addressing Modes 


Addressing modes of a processor dictate where to 
load the operands for an instruction. The operands 
may be from registers, memory, or the instruction 
itself. Typical addressing modes include register 
direct, direct, register indirect, index, and 
immediate. Not all processors implement all 
addressing modes because again there is a tradeoff 
between cost and performance. The more addressing 
mode supported, the more space and cost will be. 
However, almost all processers support register 
direct addressing mode. In register direct, operands 
are stored in register. Direct addressing allows the 
address of operands to be specified in the 
instruction. Register indirect addressing mode stored 
the address of an operand in a register. It is typically 
used to specify a memory operand. Index addressing 
mode is also used to specify a memory operand, 
especially an array element, with the form, 


base(R) 
where base is the starting address and 


R 


is a register that keeps index. The effective address 
is calculated by 


effective(addr) = base + R 


. Since the array element is of the same size, access 
array element is simply changing the index register 
R. There are other special addressing modes, e.g., 
symbolic addressing and absolute addressing, to be 
discussed in a later section. 


Processor Structures —- Memory-to-Register and 
Load/Store Architectures 


The term “Load” is an operation that moves data 
from memory to registers whereas the term “Store” 
is referring to the opposite operations, i.e., store 
data in memory from registers. In most reduced 
instruction set computer (RISC) architecture, such as 
MIPS, there are abundant of registers, and memory 
accesses are 100-400 times slower than registers. 
Thus, the intent is to keep operands in registers for 
operations, and limit only the load and store 
instructions to interact with memory. This type of 
architecture is normally called load/store 
architecture. 


Data movement in memory-to-register architecture 
is determined by addressing modes. In this 
architecture, every instruction (not just load/store 
instructions in the load/store architecture) could 
move data around potentially. For example, in 
MSP430, the instruction ADD O(R4), R5 will sum R5 
and the content at the memory address 


0+ R4 


, and put the result in the register 
R5 


. This instruction actually moves data from memory 
to register implicitly. On the other hand, the 
instruction ADD 0(R4), 0(R5) will move data from 
memory to registers, operate upon them, and move 
the result back to memory. The two source operands 
are specified in memory at locations, 


0+ R4and0+R5 


. The results is stored in the memory at location 


0+R5 


The difference between memory-to-register and 
load/store architectures is that only load and store 
instructions may access memory in the load/store 
architecture but all instructions with suitable 
addressing modes are allowed to access memory in 
the memory-to-register architecture. The memory- 
to-register architecture is typically implemented in 
complex instruction set computers (CISC), which 
provide flexibility with a more complex process 
design. 


Instruction Sequencing, Flow-of-Control, 
Subroutine Call and Return Mechanisms 


Instructions in a program are executed in 
sequentially on a single processor system. This 
instruction sequencing is broken when there is a 
need such as a condition change or an event occurs. 
For example, a program that counts how many A’s 
and how many non-A’s in a grading sheet. The 
program logic should depend on the grade input. If 
the grade input is “A”, a piece of code that increases 
a counter by one for A’s is executed. Otherwise, 
another piece of code that increases a counter by 
one for non-As is executed. This program flow is 
controlled by the data input. 


Conditional operations such as if-then-else and loop 
structures are implemented and supported at the 
machine level. There are instructions to support 
subroutine calls and returns as well as interrupt 
service routines. These control instructions have to 
be implemented in machine level, meaning that 
there are corresponding supporting instructions such 
as JUMP, CALL, RET, and RETI, where RET is used 
for the return of a normal subroutine call, and RETI 
is used for the return from an interrupt service 
routine. 


Structure of Machine-Level Programs 


Most of the machine-level programs such as 


assembly programs are very organized. They are 
written in mnemonic symbols, each of which has a 
corresponding machine instruction. For example, 
ADD is used for addition, SUB is used for 
subtraction, etc. A label is used to represent a 
memory location. Most of the assembly 
programming follow a four-column format as 
illustrated in . 


Table 3 The Four-Column Format in Assembly 
Programming 


1% column 2™ column 3% column 4" column 


Label Op-code Operands ; Comments 


The first column specifies a label, which designates 
a particular program memory address in question. 
Normally, this address would be the target for a 
jump or the beginning of a loop or a subroutine. The 
second column specifies an operation via its 
mnemonic symbol, which typically is called op- 
code. The third column declares operands. There 
may have several operands subject to a processor 
design. The order of the operands determines which 
of them are sources, and which of them is the 
destination. The fourth column is the comments, 
preceded by a semicolon normally. The comments 
describe what the instruction at that line is all 
about. It is highly recommended to document as 
detail as possible as assembly programming is less 
structural. Comments will greatly increase its 


readability. 


Limitations of Low-Level Architectures 


The instruction set architecture specifies what a 
process can do. A program to be executed on a 
processor would have to contain instructions from 
its instruction set. Obviously, there are a number of 
operations that are not implemented directly in its 
ISA. For example, multiplication is not normally 
implemented in embedded processors such as 
MSP430. In order for the missing operations to be 
still performed on a processor which may not 
directly implemented the missing operation, a 
software approach such as macro or subroutines will 
be applied. In the multiplication example, a shift- 
add algorithm may be implemented using all the 
instructions supported by MSP430. The limit of an 
architecture, however, varies significantly. Factors 
resulting in the limit include application, cost, 
performance of a processor. 


Other architectural limitations include addressing 
modes, instruction weights, size of registers, etc. 
Addressing modes are directly supported by the 
hardware. For example, register indirect addressing 
mode is considered as a high performance 
mechanism to access an operand in memory in 
MSP430. However, it only supports source operand. 
The destination operand may not be specified by 
register indirect addressing. A workaround would be 


index addressing, but it is slower than register 
indirect addressing. 


Not all instructions are created equal. Some 
instructions may require more clock cycles to finish. 
Some may require less. The number of registers is 
fixed in a processor. Some processor has more 
registers but some has less. Knowing the limitations 
of low-level architectures is quite important if a 
system is developed not just to be correct but also 
high performance. 


Low-Level Architectural Support for High-Level 
Languages 


The higher level programs have to be translated to 
machine code eventually. Should there is no support 
at a particular ISA, the design of compiler along 
with its performance will be affected drastically. 
The low-level architecture supports such as user- 
defined subroutines and interrupt service routines 
directly affect the design and the performance of 
high-level languages. 


The high-level program constructs such as if-then- 
else’s, do-loop’s, functions, and interrupt service 
routines are translated to suitable machine 
instructions. This translation is normally done by a 
compiler. A compiler writer, therefore, would have 
to figure out what machine instructions may be used 
to support those high-level programming constructs. 


The syntax for high-level programming languages is 
as follows: 


Table 5 Syntax of It-Then-Else Program Construct 


if B then 
Sie 
else 
S24 
end: ar? 


In , B is a Boolean expression, and S1 and S2 are 
statements. The Boolean expression will be 
evaluated to either true or false. That condition is 
then used to change program control flow. This 
program construct when translated to machine code 
will have the following structure: 


Table 6 Machine Code Structure for an If-Then-Else 
Construct 


B ; code for B 

JF False ; jump if B is false 
Sl ; true statement 

J EndIf ; jump to EndIf 
False: S2 ; false statement 
Endit: &) Era act 


If B, S1, and S2 can be translated using supported 
instructions, the low-level instructions JF and J 


would have to be supported. Here JF is a 
conditional jump instruction subject to the result of 
its previous instruction. Most processors will keep 
execution results in a status register (SR). The JF 
instruction will check SR before change the control 
flow. Another instruction J, unconditional jump, 
often is supported. A loop has a similar structure. 


At the assembly language level, how parameters are 
passed to subroutines and how local workplace is 
created and accessed will definitely affect the 
overall performance. The lack of resources has an 
impact on high-level languages and the design of 
compilers. An argument stack is typically used to 
pass actual parameters to a function in most high- 
level languages. Access to each of the parameters 
will rely on stack operations, such as PUSH and 
POP. The hardware support for the stack operations 
is essential to the system performance in light of the 
heavy use of subroutines and functions in high-level 
programming. 


Two most important instructions that hardware 
must support in order to execute subroutines or 
functions are CALL and RET. The CALL instruction 
will change the value of PC to the starting address 
of the subroutine to be called, and push the address 
(i.e., return address) of next instruction after the call 
instruction itself in the stack. Although the 
programmer may freely change PC, the address of 
next instruction may not be known until runtime. 


Therefore, the hardware support for the CALL 
instruction is a must. The RET instruction is the last 
instruction in a subroutine or a function. Its purpose 
is to rewind stack and restore PC with the stored 
return address. Subject to the CPU design, the RET 
instruction may also perform other tasks such as 
restoring SR. 


Binary Arithmetic 

This chapter overviews binary arithmetic including 
finite precision arithmetic for addition, subtraction, 
multiplication, and division. One’s and two’s 
complement representations for negative numbers in 
digital systems are introduced. Hardware adder is 
given as an example for fundamental digital 
computing. 


Binary Arithmetic 


Binary arithmetic is used in digital systems mainly 
because the numbers (decimal and floating-point 
numbers) are stored in binary format in most 
computer systems. All arithmetic operations such as 
addition, subtraction, multiplication, and division 
are done in binary representation of numbers. It is 
necessary to understand the binary number 
representation to figure out binary arithmetic in 
digital computers. 


In most ALU (arithmetic logic unit) hardware, the 
operated numbers are stored in a fixed number of 
bits, a typical value between 6 and 16 decimal 
digits. Therefore, there is a precision limit or 
precision error performing binary arithmetic on 
computers. This binary arithmetic is called fixed- 
precision arithmetic. This contrasts to arbitrary- 
precision arithmetic, such as Java BigInteger, a 
technique that calculations are performed on 
numbers whose precision is only dependent on the 


amount of memory available in the system. In other 
words, a number could occupy memory space as 
large as possible if there is a need for higher 
precision. We will focus on fixed-precision 
arithmetic from now on. 


Since the precision of numbers stored in the 
computer is fixed, the size of numbers is fixed and 
determined when the computer is built. A common 
integer size found in computer systems is 8, 16, 32, 
and 64 bits. Once the computer is built, the size of 
numbers is fixed and may not be changed. 
Therefore, as a programmer who develops computer 
programs running on the computer architecture, it 
makes sense to know what the precision limit is for 
the underlying architecture. This way some 
precision errors such as rounding error may be 
avoided. For example, in integer precision is equal 
to not . If is expected, the program has to be 
rewritten in a way to take care of the precision. To 
remedy this, most programming languages offer 
type conversions. To get in the previous example, 
simply rewrite it as . The will tell a compiler to 
allows float point space for it and assign floating 
point arithmetic, instead of integer arithmetic. 


We will discuss binary arithmetic from the logic 
design perspective in the following sections. Quite 
often that there are more than one logic design for a 
binary operation, a “better one” is always a best 
choice. The better one means the complexity is 


manageable and the logic circuit is simple. In 
hardware domain, simply is good and simply 
normally leads to better performance. 


Finite Precision Arithmetic 


Consider operations are limited to 2-digit non- 
negative decimal integers. So the numbers are 0, 1, 
2, ..., 99, and denoted as . The result of an 
operation among the numbers has to be in the set . 
Otherwise, the operation is invalid. There a valid 
operation is defined as follows. 


, where , and is an operator. 


So a valid operation will map two numbers in the 
set to another number in the set. The operator can 
be addition, subtraction, multiplication, division, 
and the like. Taking addition as an example, is a 
valid operation because the result is still in the set. 
However, is not, because the result is not in the set 
and only 00 is. In a more precise way, a valid 
addition over the set is as follows. 


where and . 
Since are non-negative, the result of adding and is 
also non-negative. Thus, we only need to limit the 


result to be less than 100. So it is still in the set . 


Another concern is the carry. In the 2-digit non- 
negative decimal system, adding the first digit of 


two numbers may be larger than 9, say . In that case 
we write down as the result for this position, and 
keep one as the carry for the next position (to the 
left). Table 1 illustrates a decimal addition example 
of . 


Table 1 An Example of Carry for a Decimal Addition 
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Addition 


In the binary numeral system, the symbol set is 
composed of 0 and 1. A positional binary number is 
a sequence of 0’s and 1’s. Given an -bit binary 
number, , the leftmost bit is called most significant 
bit (MSB) because its weight is , and the rightmost 
bit is called least significant bit (LSB) for the reason 
that it only weights . Once the binary numbers are 
lined up with their weights, i.e. powers of 2, they 
may be operated just like the decimal arithmetic. It 
makes no difference when numbers in another 
format such as hexadecimal are operated. Consider 


a 4-bit non-negative numeral system. There are 
numbers from to , i.e., decimal to . So, any valid 
operation among these numbers should result in one 
of them. Also, the carry mechanism works similar to 
the decimal addition. For example, Table 2 shows 
an example of carry for a binary addition. Compared 
to the decimal carry, the binary carry will “carry” 2 
from the first position over the second position. In 
the addition case, the only possible carry occurs 
when both operands are 1, i.e., with carry 1. 


Table 2 An Example of Carry for a Binary Addition 
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In general, the carry mechanism applies to other 
numeral systems such as octal and hexadecimal. The 
value of carry in additions is always one. The 
number we write down for this position is 
calculated by , where is the sum of the two 
numbers, and is the radix or the base. The value of 


carry could be larger than one in cases such as 
multiplication. Table 3 shows an example of carry 
for an octal addition whereas Table 4 gives an 
example of carry for a hexadecimal addition. 


Table 3 An Example of Carry for an Octal Addition 
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Table 4 An Example of Carry for an Hexadecimal 
Addition 
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Hardware Adder 


Since numbers stored in computers are in binary 
format, a hardware adder is built based on binary 
inputs. Recall that when adding two binary digits in 
each position, a carry may be generated, which is 
then added in the position to the left. Based on this 
observation, a -bit hardware adder is built upon 
one-bit adders, each of which is taken care of the 
one-bit addition for one position. Therefore, we 
should build the one-bit adder first and combine of 
them together to perform -bit addition. 


In hardware design, a very first and important step 
is find out what inputs and output are, and what 
their relations are? The inputs to the one-bit adder 
are one bit, say , from one operand, one bit, say , 
from another, and don’t forget one carry bit, say , 
from the previous position. The outputs are 
obviously the sum, say , and the carry bit, say . With 
the inputs and outputs ready, the next step is find 
out their relations. The following truth table (Table 
5) shows the relations. The truth table is built based 
on the binary arithmetic that , with carry 1, and 
with carry 1. 


Table 5 The Truth Table of One-Bit Adder 
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Most hardware circuits are built based on their truth 
table. So it is important to tabularize the truth table 
to show the relations of inputs and outputs. The 
truth has two outputs and they should be treated 
separately when simplifying the Boolean expression. 
In other words, one Boolean expression will 
represent the relation of the inputs and the sum ; 
another will represent the relation of the inputs and 
the carry out . The Boolean expressions for the 
outputs are as follows: 


Note that the expression for can be simplified using 
the Karnaugh map but the other Boolean expression 
may not. With the Boolean expressions, the 
hardware is built accordingly as shown in Figure 1. 
Note that the not gate in the diagram is 
implemented using the INV component in the 
library of the Xilinx ISE schematic design tool. 
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Figure 1 The Circuit of the Sum of the One-Bit 
Hardware Adder 


The adder circuit can be verified according the its 
truth table using simulation tool such as the Xilinx 
ISim. The ISim tool will simulate the circuit based 
on any inputs, called stimuli, set by a tester. The 
ISim will simulate the circuit and report outputs, 
which are then verified according to the truth table. 
To use ISim to simulate your circuit, a test bench 
will be created. The test bench contains an instance 


of the unit under test (UUT) and stimuli, a 
combination of input values. In our case, the UUT is 
the adder, and the inputs are . The [Sim will then 
output a waveform as shown in Figure 2. In the first 
column of the waveform, the inputs along with the 
output signals are listed. On the top of the 
waveform is the timeline from 0 to 100,000 with the 
unit picoseconds. A tester should validate the 
outputs signals at each time instance when the input 
changes. However, it is tedious to check should 
there were a lot of inputs with lots of combinations. 
Therefore, ISim provides an assert statement to 
automate the verification process. In the above 
simulation test bench, the following state will 
automatically check if wherever the is supposed to 
be 1. 


Note that this statement is placed right after the 
input stimuli are provided. In this case, when is 1, is 
1, and is 0, the output signal is supposed to be 1 
according to the truth table. In case that the 
simulation result shows the is 0, the message quoted 
will be reported and the tester will be aware of this 
error after the simulation is performed. Note also 
that the message typically should include the test 
case detail. So when this error is found, this 
particular test case with the corresponding input 
values will be used to debug the original design. 


Figure 2 The Waveform of the One-Bit Adder from 
the ISim Simulator 


Negative Numbers 


Mathematically, adding a negative number is 
equivalent to taking out the absolute value of the 
number, i.e., subtraction. In two’s complement 
system, the two’s complement of a number is equal 
to the negation of this number. In a 4-bit signed 
binary numeral system, using the two’s complement 
method, the represented numbers range from -8 to 
re 


The two’s complement representation of a number is 
equal to . For example, to find the two’s 
complement representation of in the 4-bit system, 
we first find the representation of , which is . The 
binary number is then converted using two’s 
complement method. The two’s complement of is . 
Given a negative number in two’s complement 
representation, it is hard to tell what decimal value 
it represents. Yet, we can first convert it to a 
positive number using two’s complement and 
convert it to decimal. It is a bit tedious. Is there a 
direct calculation just like converting a positive 
binary number to decimal? The answer is yes. Here 
is why? 


To find the decimal value of , all we need is 
calculate the sum by definition and take out , where 
is the total number of bits. An example is depicted 
in Table 6, which directly calculates the sum as if it 
were a positive number, and add to it. 


Table 6 An Example of Direct Conversion of a 
Negative Binary Number Represented in Two’s 
Complement 
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Overflow 


Due to the limited space to store numbers in 
computer systems, the fixed-precision arithmetic 
may result in an invalid operation in which the 
result cannot be represented in the system. This is 
referring to as overflow. If such situation happens, 
computer hardware will flag a bit normally found in 
a status word (a special register for status report in 
CPU). The run-time system, normally the operating 
system, will report it to the running software and an 
exception or a trap will be raised for a suitable 
action to be taken. 


An overflow occurs when the result of an operation 
is larger than the maximal number or smaller than 
the minimal number that a system represents. For 
addition and subtraction operations, the condition 


for an overflow to occur is as follows: 


1. Sum two positive numbers but the result 
becomes negative, or 

2. Sum two negative numbers but the result 
becomes positive. 


If we consider subtraction, a positive number minus 
a negative number leads to the case a. A negative 
number minus a positive number belongs to case b. 
On the other hand, the following cases will not 
cause overflow. 


1. Sum two numbers of different signs, i.e., one 
positive and one negative, or 

2. Minus two numbers of the same signs, i.e., two 
positives or two negatives. 


If subtraction is considered, the case includes a 
positive number minus a negative number, and a 
negative number minus a negative number. In the 4- 
bit signed system, if we try to overflow an addition 
with two numbers of different signs, select the 
maximal number, i.e., 7, and the largest negative 
number, i.e., -1. Summing the two will not cause 
overflow. On the other hand, if we select the 
smallest positive and the smallest negative number 
with an attempt to underflow the addition, the 
result is still okay. Therefore, under such situation 
that summing of two numbers with different signs, 
there will not be any overflow. 


The overflow in the fixed-precision arithmetic 
invalidates some properties held in regular 
arithmetic. The associate law is not valid in the 
fixed-precision arithmetic. For example, in the 4-bit 
signed system, because adding 3 and 5 will cause an 
overflow which invalidates the operation. Moreover, 
the distribution law is not valid, either. For 
example, for the same reason that will cause 
overflow. Note that the commutative law still 
applies. 


Subtraction 


Subtracting a number from another is equivalent to 

adding a negative number. In binary arithmetic, we 

first apply two’s complement to the subtrahend, and 
then add the two’s complement of the subtrahend to 
the minuend. The result will be the difference. 


Table 7 An Example of Subtracting 3 from 7 Using 
Two's Complement Addition 
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Using two’s complement for subtraction greatly 
simplifies the computer design as we only need to 
build a hardware adder for both addition and 
subtraction. Note that the negative 2 is stored in its 
two’s complement format in memory. So there is no 
need to build a special hardware circuit for the 
conversion. Also, the most significant bit (MSB) 
yields a carry, which can be safely ignored because 
adding two numbers of different signs will cause 
any overflow. 


Multiplication 


Binary multiplication is similar to decimal 
multiplication, and in fact, the binary one is much 
easier than the decimal one. Let’s revisit the 
multiplication technique learnt from elementary 
schools. Table 8 shows the technique to multiply 
1234 by 4321. Observe that each time the 
multiplicand is multiplied by a digit of the 
multiplier, and the partial product is written down 
in line with the multiplier digit. The product is 
obtained by adding up all the partial products. Note 
that the number of digits for the product should be 
the sum of the numbers of the multiplicand digits 
and multiplier digits. In the previous example, the 
number of the product digits is 7. Should there were 
a carry at the leftmost digit, the total number of the 
product digits is 8. The doubled size is normally 
considered in computer design. In a typical CPU, 
multiplication is done by a special hardware circuit 


not in the ALU. There may have several hardware 
multipliers for fast computation. 


Table 8 An Example of Decimal Multiplication 
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In the binary case, the same multiplication 
technique applies. Let’s take a look at an example, 
and observe how a hardware multiplier can be 
implemented. Table 9 illustrates an example of 
binary multiplication. Observe that the rows with an 
arrow () are identical to the multiplicand, and the 
rest are all zeros. Also, the non-zero rows are from 
the one’s digits of the multiplier. When adding up 
the partial products (all equal to the multiplicand), 
they only shift left a number of positions according 
where the multiplier digit is. 


Table 9 An Example of Binary Multiplication 
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With the above observation, a hardware multiplier 
is designed based on an adder and two shifters. The 
adder is used to add partial products, whereas the 
shifters are used to shift the multiplicand left at a 
time after a corresponding bit in the multiplier is 
checked, and shift right the multiplier. Figure 2 
illustrates the flowchart of the hardware 
multiplication algorithm. In this algorithm, there are 
three registers, one for the multiplicand, one for the 
multiplier, and one for the product. The registers for 
the multiplicand and multiplier are shift registers. 
The multiplicand register will shift left whereas the 
multiplier register will shift right in each iteration. 
If the size of the multiplier is bits, the size of the 
multiplier register needs only bits. However, the 
registers for the product and the multiplicand will 
have to have bits, simply because the shift left 
operation for the multiplicand register and the 
potential space need for the product. These 
requirements can be improved if we instead shift the 
product to the right. Figure 3 shows the improved 
version of the multiplication. 
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Figure 3 The Flowchart of a Hardware Binary 


Multiplication Algorithm 


In the improved version shown in Figure 3, instead 
of shifting multiplicand left, the product register is 
shifted to the right in each iteration. Thus, the shift 
register for the multiplicand is saved. However, the 
adder that adds the multiplicand the partial product 
will have to store the result to the high half bits. 
Compared to the previous version, the improved 
version requires less hardware resource and should 
perform better based on the rule of thumb that “the 
less, the better” in hardware design. 
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Figure 4 An Improved Version of a Hardware 
Multiplier 


Devision 


With the experience learnt from the hardware 
multiplier design in the previous section, binary 
division works similarly. Let’s start with an example 
of long division using the technique from the 
elementary school. Table 10 illustrates the long 
division for divided by . In decimal, it is 178 divided 
by 10. The quotient is 17 and the remainder is 8. 
We call 178 the dividend and 10 the divisor. Their 
relation is . 


Table 10 An Example of Binary Division Based on 
Traditional Long Division 
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The binary long division is easier than its decimal 
version because there are only two choices (0 or 1) 
when we guess the quotient digits in each iteration. 
If the first digit of the dividend portion is 0, the 
quotient digit will be 0. However, if the first digit of 
the dividend portion is 1, the quotient digit could be 
either 1 or 0. In the example shown in Table 10, 
most of the time the quotient digit is 1 when the 
first digit of the dividend portion is 1. In the row 
with an asterisk, however, selecting 1 as the 
quotient digit will be wrong because the divisor is 
larger than the dividend portion. In that case, , the 
result after subtraction will be , i.e., , because the 
dividend is 10 but the dividend portion is 9 in 
decimal. Therefore, selecting quotient digit equal to 
the first digit of the dividend portion may fail. 


In order to take care of the failure case, we need to 
validate the result of the subtraction in each 
iteration. Assume both the dividend and the divisor 
are positive. The quotient and the remainder are 
non-negative. The result of each subtraction should 
be non-negative. If the result is negative, the 
quotient digit should be 0 and the dividend portion 
should remain the same. In the previous example, 
the row with an asterisk, the subtraction result is . 
The quotient digit should be 0 and the dividend 
portion should be restored back to not . 


So the binary division algorithm is similar to that of 
the binary multiplication. The divisor is stored in a 
shift register, and it is shifted to the right in each 
iteration. Each time we only got one bit for the 
quotient, to keep the quotient bits, we can append 
one bit to the LSB of the quotient register and shift 
it to the left in each iteration. So a shift register is 
needed for the quotient. The subtraction can be 
implemented using an adder by converting the 
subtrahend to its two’s complement. Figure 4 shows 
the flowchart of the algorithm. 


In the binary division algorithm shown in Figure 4, 
two shift registers are needed. One is for the divisor 
and the other is for the quotient. Their length is as 
long as that of the dividend. Including a register for 
the dividend, there are totally 3 registers. Although 
the first bit of the dividend portion may not be 
equal to the quotient bit, the subtraction result 
serves as a correct decision. If the subtraction result 
is non-negative, the quotient bit is 1, and the 
subtraction result is used to update the dividend. On 
the other hand, if the subtraction result is negative, 
the dividend remains intact. Note that the operation 
that adding one to the quotient can be integrated to 
the shifter for the quotient. A normal shift left 
operation will put zero the LSB. We could just 
append one while shifting the quotient register left. 
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Figure 5 The Flowchart of a Binary Division 


Algorithm 
Summary 


The binary arithmetic (addition, subtraction, 
multiplication, and division) discussed in this 
chapter operates in a similar way to its decimal 
version. By using two’s complement representation, 
negative numbers are represented in a way that 
subtraction will be done by addition. This greatly 
reduces the hardware complexity in CPU design. 
Therefore, the four binary arithmetic operations 
require only adder, shifter, and two’s complement 
hardware components. 


Due to the fixed-precision arithmetic, i.e., numbers 
are stored in a fixed length register, programmers 
and hardware designers must be aware of the limit 
of the fixed-precision. When a number is beyond the 
maximal number or below the minimal number of a 
fixed length register that may represent, an overflow 
or an underflow occurs. When that happens, the 
computation result is incorrect and should be 
discarded. This is typically the case that no result is 
better than a wrong result. 


Booth’s Algorithm 


Booth algorithm gives a procedure for multiplying 
binary integers in signed 2’s complement 


representation in efficient way, i.e., less number of 
additions/subtractions required in general. It 
operates on the fact that strings of 0’s in the 
multiplier require no addition but just shifting and a 
string of 1’s in the multiplier from bit weight 2k to 
weight 2m can be treated as 2k +1 to 2m. 


As in all multiplication schemes, booth algorithm 
requires examination of the multiplier bits and 
shifting of the partial product. Prior to the shifting, 
the multiplicand may be added to the partial 
product, subtracted from the partial product, or left 
unchanged according to the following rules: 


1. The multiplicand is subtracted from the partial 
product upon encountering the first least 
significant 1 in a string of 1’s in the multiplier 

2. The multiplicand is added to the partial 
product upon encountering the first 0 
(provided that there was a previous ‘1’) ina 
string of 0’s in the multiplier. 

3. The partial product does not change when the 
multiplier bit is identical to the previous 
multiplier bit. 


Booth’s Algorithm Flowchart 


The Booth’s algorithm can be described using the 
following flowchart. We name the register that 


keeps the partical product as A, the multiplicand as 
M, the multiplier as Q, the extra bit Q—1 attached 
to QO, which is the least significant bit of the 
multiplier, and the counter as Count. The flowchart 
for the booth algorithm is shown below. 


A 0Q, 0 
M  Multiplicand 
Q = Multiplier 
Count an 


Arithmetic Shift 
Right: A,Q,Q_, 
Count Count—-1 


A and the appended bit Q—1 are initially cleared to 
O and the sequence Count is set to a number n equal 
to the number of bits in the multiplier. The two bits 
of the multiplier in QO and Q—1 are inspected. If 
the two bits are equal to 10, it means that the first 1 
in a string has been encountered. This requires 
subtraction of the multiplicand from the partial 
product in A. If the 2 bits are equal to 01, it means 
that the first 0 in a string of 0’s has been 
encountered. This requires the addition of the 


multiplicand to the partial product in A. 


When the two bits are equal, the partial product 
does not change. An overflow cannot occur because 
the addition and subtraction of the multiplicand 
follow each other. As a consequence, the 2 numbers 
that are added always have a opposite signs, a 
condition that excludes an overflow. The next step is 
to shift right the partial product and the multiplier 
(including Q—1). This is an arithmetic shift right 
(ashr) operation which A, Q, and Q—1 shifts to the 
right and leaves the sign bit in A unchanged. The 
sequence counter Count is decremented and the 
computational loop is repeated n times. 


Examples 


Let’s computate — 3 X —7 using the Booth’s 
algorithm. The 2’s complement representations for 
them are 1101 and 1001, respectively. So initially A 
= 0000, Q = 1001, M = 1101, M’ = 0010, and Q 
—1=0. The final resule is 21, which is stored in A 
and Q. 
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Why Booth’s Algorithm Works? 


Consider a positive multiplier consisting of a block 
of 1’s surrounded by 0’s. For example, 00111110. 
The product, M x 00111110 is given by 

AX (214+22+23+24+ 25), which is A x (26-21). 
So the total number of operations is reduced to two. 
Given M as a multiplicand, Q as a multiplier, we 
have the following 


Mx [(Q-1—Q0 ) 20 + QO—-Q1 ) 21 +...4+ 
Qn-2—Qn-1 ) 2n-1 ] =Mx[(Q-1—Q0 ) 20 + 
2(Q0=—01)20 + 2001 =—O2 ) 21 +...4+ 
2(Qn-2—Qn-1 ) 2n-2) ] =Mx ( Q020 + Q121 + 
+...4+ Qn-12n-1) =MxQ 


Note that the 2’s complementation of Q has the 
value of Q020+Q121+...—Qn—12n-—1. 


Number Systems 

This chapter overviews how different number 
systems (binary, octal, decimal, and hexadecimal) 
are defined and how to convert form one system to 
another. Explain machine-level representation of 
data. Introduce bits, bytes, words, two-complement 
representation of numbers, records, arrays. A type 
as a set of values together with a set of operations. 
Show primitive types (e.g., numbers, Booleans). 
Compare representations of integers to floating 
point numbers. Describe underflow, overflow, round 
off, and truncation errors in data representations. 


Numeral Systems 


The notion of “number” stems from counting. A 
number is used to represented objects of the same 
kind. For example, a pile of pebbles can be used to 
represent a number. Therefore, starting from a pile 
of nothing, which represents the first number, the 
numeral system would be built by putting more 
pebbles one at a time. By doing so, the number 
system is composed of a pile of nothing, a pile of 
one pebble, a pile of two pebbles, and the like. 
However, it is inconvenient to describe numbers in 
this way. Instead, symbols such as 0, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, etc., are introduced to represent these 
numbers as Arabic numbers, and symbols such as I, 
II, II, IV, V, VI, VII, VIII, IX, X, etc., are used as 


Roman numbers. Note that Roman numeral system 
does not include zero. 


Positional Notation 


Every numeral system uses symbols to convey 
information about the value of a number. However, 
for a larger number, it is hard to find a specific 
symbol to represent it. Therefore, the actual value of 
a symbol is designed to have values dependent on it 
position. Position notation or place-value notation is 
a method of representing or encoding numbers. 
Arabic numerals are positional whereas Roman 
numerals are not. In the Roman numeral system, X 
means 10 no matter where in a number it appears. 
The Arabic number 10 contains two symbols: 1 and 
0. Since it is positional, the first symbol 1 will have 
value 10 because it is in the tens place. The second 
symbol 0 has the value of 0 because it is in the ones 
place. 


Base of a Numeral System 


In Mathematics, the base or the radix is a unique 
number that a positional numeral system used to 
represent numbers. The Arabic numeral system is a 
base-10 (decimal) system. Along with the positional 
notation, for example, the value of the number 1234 


is determined as follows: 


1234 =1%x* 103+2x102+3x101+4~x 
100 


For the right to the left, 4 is called ones digit, 3 is 
called tens digit, 2 is called hundreds digit, and 1 is 
called thousands digit. The number 10 is the base or 
the radix of the decimal system. In general, we 
would choose any number as the radix for a 
numeral system. However, the radix normally is an 
integer and is larger than one. Let r be the radix of a 
numeral system. The value of a number is defined as 


an — lan —2an—3...a2alaO=>Xi=On 
—laixri=a0x<r0+al Xr1+...tan—1Xrn-1. 


Note that any number to the power of zero is one. 
So the first term a0 x rO=a0. Some of the digits may 
be zero, say ak, and it does not contribute to the 
total value of the number because 

ak x rk =0 x rk=0. However, it still has to be there 
as it severs as a positional holder. Numbers 
represented in this way are called r-nary numbers. 
With the base and positional notion, all non- 
negative integers can be represented in the decimal 
system. 


Size of Symbols Required in a Numeral 
System 


It is obvious that the size of the symbols used in a 
non-positional numeral system is infinite. The 
reason is because each number in the system would 
have to have a specific symbol accordingly. In a 
positional numeral system, however, the number of 
symbols is equal to the base. In the decimal numeral 
system (base-10), only 10 symbols are required. 
Should we add one more symbol, say “A”, to 
represent “10”, we would end up with two different 
notations “A” and “10” for the number 10. Also, 
“AO” is a duplicate for “100”. Since the rth symbol 
that represents r can always be represented by 

10(1 x r1 +0 x r0) where r is the base, there is no 
need to have this extra symbol. This fact applies to 
digits in other positions. Therefore, the size of 
symbols in a positional numeral system is equal to 
its base. 


Given a set of symbols with size r, to form a 
positional numeral system, its base has to be less 
than or equal to r. When the base is less than r, 
some numbers may not have unique notations. It is 
therefore convenient to choose the base that is equal 
to the size of symbols. 


Negative Integers 


The negative integers are represented by adding a 
minus symbol prefixed to a number. For example, 
-1234 denotes negative one thousand two hundred 


and thirty four. Note that the symbols used in the 
decimal numeral system now include minus, in 
addition to the Arabic numbers 0 to 9. 


Real Numbers 


A real number is composed of an integral part and a 
fraction part. The integral part can be represented 
using the decimal numeral system established so far. 
In order to represent the fraction part, the period “.” 
symbol is added. The digit immediately to the right 
of the period denotes one-tenth (10-1), followed 
by one-hundredth (10 — 2), and the like. A real 
number in the decimal numeral system is defined as 
follows: 


an — lan — 2an—3...a2ala0.b0b1...bm 
—1bm = *Xi=0n-— 1ai X rit+ Xj = Ombj x 10—j. 


For example, 
12.34=1x101+2x100+3x10—1+4x10-2. 
This encoding system would represent most of the 
real numbers. However, some real numbers such as 
17 may not be represented because 

17 = 0.142857242857... is a repeating decimal. 
Nevertheless, an approximate that is close enough 
would be practically useful for most applications. 


Numeral Systems in Computer Systems 


Even though any numeral system could be used to 
represent numbers used in internal computer 
systems, the binary numeral system is adopted 
because of the underlying digital architecture. In the 
following, the binary numeral system is introduced, 
followed by octal and hexadecimal numbers with 
their translations. 


Binary Numerals 


In digital systems, 0 and 1 are used to represent on 
and off, respectively. Originally, “on” refers to 
electrical charge and “off” indicates no electrical 
charge. Regardless of the amount of charge, if the 
amount is larger than a threshold (called logic 
level), it is in “on” state or called logic high. On the 
other hand, if the amount is below the threshold, it 
is in “off” state or called logic low. 


Since there are only two symbols (0 and 1), to form 
a positional numeral system, we would choose the 
base equal to 2, i.e., the size of the symbols. Because 
of two, the numeral system is called binary 
numerals. For example, counting from zero up in 
binary numerals will result in the following series: 


0,1,10,11,100,101,110,111,1000,1001,1010,1011, 


This series represents the decimal numbers starting 
from 0, 1, 2, 3, ..., 11,... Each digit in a binary 
number is called a bit, which requires a hardware 
component to implement so as to keep the charge 
(information) or release it. Normally, the base is 
well-defined in the above representations. However, 
to make it clear, the base can be explicitly indicated 
as a subscript like 10002 represents the decimal 
number 810 not 100010. In a later section, we will 
study how a binary number is converted to an 
equivalent decimal number, or vice versa. 


Range of Binary Numbers 


The computers are designed to keep and manipulate 
a number in a place of “fixed” length. Unlike writing 
a number on a piece of paper where you can keep 
putting as many digits as you want as long as there 
is space, computers can only dedicate a fixed space 
for it. This fixed length space is called register. The 
length of the register is determined during the 
computer design. Normally, there are 1-bit, 2-bit, 4- 
bit, 8-bit, 16-bit, 64-bit registers, and so on. Since 
each bit can only be either one or zero, a register is 
limited to represent a subset of numbers. An 1-bit 
register can only represent 0 or 1; a 2-bit register 
can represent 00, 01, 10, and 11, four possible 
combinations. In general an n-bit register will 
represent 2n possible numbers. lists possible number 
representations on a register of fixed length for 


positive numbers only. For an n-bit register, the 
unsigned integers range from 0 to 2n—1. 


Table Possible Number Representations on a 


Register of Fixed Length 


Number of bits 


Ov WW FA CO -S LO FA 
bo Cy 


Range for Unsigned 


01291967295 
0 
18446744073709551615 


Negative Numbers Using Sign-And- 
Magnitude, Ones’ Complement, and 


Two’s Complement 


Like the decimal numeral system, we would 
dedicate one bit to represent if minus is there or not. 
This bit is typically placed in the leading bit and is 


called sign bit. Positive numbers are represented 
with size bit equal to zero. A sign bit with one 
indicates negative numbers. The approach is called 
“sign-and-magnitude” method. In a 4-bit signed-and- 
magnitude example, the numbers represented are 
shown in . The range of the numbers represented in 
this method given a total of n bits is from —(2n 
—1-1) to 2n-—1-1. 


Table The 4-Bit Numbers Represented by the Several 
Methods 
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Early computer systems (e.g., IBM 7090) and 
floating point values adopt this method because of 
its natural characteristic. Its problem, however, is 
that there are two notations for zero: positive zero 
and negative zero. Since zero testing is used 
extensively in computing, the method may not be 
suitable. Other methodologies are necessary for 
representing negative numbers in the binary 
numeral system. 


Ones’ complement can be used to represent negative 
numbers. If the leading bit is zero, the number is 
positive, similar to the sign-and-magnitude method. 
When the leading bit is one, the absolute value of 
the number is calculated by inverting one to zero 
and zero to one, bit by bit. This process is called bit- 
wise not, or ones’ complement. lists the 4-bit 
numbers represented by the ones’ complement 
method. The range of the numbers represented in 
this method given a total of n bits is from —(2n 
—1-1) to 2n—1-—1. Like the sing-and-magnitude 
method, the problem of the ones’ complement 
method is that zero is represented by two different 
ways. The ones’ complement method is commonly 
used in older machine like PDP-1. 


Two’s complement is designed to circumvent the 


duplicate zero problem found in the previous 
methods. In this method, negative numbers are 
represented by ones’ complement plus one. By doing 
so, the negative numbers in a sense are shifted to 
the left (smaller) by one. The negative zero 
represented by ones’ complement becomes -1 in the 
two’s complement method as shown in . To find the 
two’s complement representation of a negative 
number, we can follow the following steps: 


1. Convert positive number to binary , 
2. Find ones’ complement, and 
3. Add one to the result. 


For example, to find the two’s complement 
representation of -6: 


1. Convert 6 to 0110, 
2. Ones’ complement is 1001, and 
3. Add one to the result: 1010. 


Thus, the two’s complement representation of -6 is 
1010. Note that the leading bit still represents the 
sign. Also, give a two’s complement representation 
of a negative number, we could apply two’s 
complement on it to get the absolute value of it. In 
the above example, if we follow the two’s 
complement steps on 1010 (-6), we would get 0110 
(6) again. This is better explained by the following 
equations. Let x° be the ones’ complement of x, an 
n-bit binary number, and x' be the two’s 


complement of x. 


x=a0x20+alx21+..+an—1x2n 
—1lx’=(1-a0)x20+(1-al)x21 

+..4+¢(1l1—-—-an-1)x2n-1x’=(204+2 
1+..+2n-—-1)-—-(aOx20+4+a1xX21 + 
.. tan-1x2n-—-1)x’=(2n-1)-xx' 
=(2n-1)-—-x+1=2n-x(x')'=2n- 
C2 =x) SX 


The two’s complement of an n-bit binary number is 
2n — x whereas its ones’ complement is 2n—x—1. 
This also explains why we add one to the ones’ 
complement in the step 3 during the process of 
finding the two’s complement of a number. 


Practically, there is an easier way to find the two’s 
complement. If we search the first 1 from the right, 
and flip the digits to its left, the result is the two’s 
complement. Why? illustrates the process when we 
apply the aforementioned steps in calculating a 
two’s complement. 


Table An Easy Way to Find the Two's Complement 
of a Number 


Value to be 28 Remark 


Step 1. Convert 011100 
28 to binary 


Step 2. Find 100011 
ones’ 


VELL pPredireiite 


Step 3. Add one 100100 
to it 


Addition 


A later section 
will depict the 


PIANTATTANYOINN 
ReViIiLVvVeLVIVile 


Just flip 0 to 1, 
and 1 to 0. 


Observe that the 
trailing zeros in 
step 1 will still 
be zeros after 
adding one. The 
first one from the 
right in the step 
1 will still be 1 
after adding one. 
The digits to the 
left of it are now 
flipped because 
of step 2. 


Adding two binary numbers is similar to summing 
two decimal numbers. For the positive numbers, the 
three methods (sign-and-magnitude, ones’ 
complement and two’s complement) are working 
the same. However, there is slightly different in 


adding negative numbers among the three methods. 
shows the results of adding two 4-bit positive 
numbers. The results are obtained using a technique 
similar to the one that is applied to decimal 
additions. Note that there are carry digits shown on 
the table. Working from the right hand side to the 
left, if there are two 1’s to be added, write down the 
result O and put 1 as the carry for the next position; 
if there are three digits to be added, write down the 
result 1 and put 1 as the carry for the next position. 


Table Adding Two 4-Bit Positive Numbers 


Carry Sing-and- One's Two's 

Danimal Niaanituda Camonlamant Camnilamant 

Poa Sueecees tVAtA 112 LUe OU ded prre ste Ue stapes aie 

n ANNAN TANNAN TANNIN 

VU UUVUY huvvyv hyuvvy 

A A1TNNA aA1NN A1TNN 

1 Vaivyu Vavyu VaLVYV 

-2 -1010 (Ones (Two's 
rpamntilamant)aahkhWAam ant) 1 1 1 
VV Lea prire 21111 Uy paw atieiicy 1 

2 0010 (0001 + 1) 0010 
= 0010 


Additions with negative numbers are performed 
slightly differently on the three methods as shown 
in . First, the decimal arithmetic and the sign-and- 
magnitude method would change addition 


operations to subtraction because they keep the 
absolute value information. Second, the ones’ 
complement and two’s complement methods can do 
the addition directly. However, the ones’ 
complement method would require adding one at 
the end to compensate the offset by one error. Let x 
and y be two positive numbers. The one’s 
complement and two’s complement representations 
of negative are denoted as y -y=y =y'. Therefore, 
we have the following: 


xty° =x+(2n-y-1)=x-yr2n-1 


The result of directly summing a positive and a 
negative number in ones’ complement 
representation is x —y—1. This is the reason why we 
have to add one back in the ones’ complement 
method. Note that the term 2n results in the carry of 
the highest bit can be safely removed. In the two’s 
complement method, the offset by one error is 
solved upfront when the number is converted to its 
two’s complement representation. The mathematical 
relation is shown as follows: 


x+y'=x+(2n-y)=x-ytd2n 


The result of adding a positive number and a 
negative number in its two’s complement 
representation is equal to the result of subtracting 
the negative number from the positive. This is a 
superior characteristic of the two’s complement 


method compared to the ones’ complement method. 


Table Additions with Negative Numbers 


Decimal Sing-and- 
Carrs NMaanitaura 
Muuiny jvatagissvuLe 
0 9000 

A 0100 

-2 -1010 

2 0010 
Subtraction 


One's Two's 

Camnual aAmant Comin! Amant 
Wu SPE tt tweet Wu eae og feet cae aa acta 
LEOOD LGBOO 

D108. H1LO8. 

One's Two's 
Complement Complement 
+1101 +1110 
(0001+1) 0010 

= 0010 

(Add one 

back) 


In the addition section, the ones’ complement and 
two’s complement methods would allow directly 
addition operation even if the numbers are negative. 
Therefore, subtraction in these two methods will be 
straightforward. Simply negate the number to be 
subtracted and just doing addition as follows: 


xX-yHext(-yy)Hxty'=xt+y° tl 


gives an example of subtracting 3 from 7. The 
benefit of doing subtraction using addition is that 
there only requires an adder hardware component 
in the CPU. No specific subtraction hardware is 
required. This results in a smaller hardware space 
and simplify the design and testing in building 
hardware because of reusability. It will be really 
good if a component is designed and thoroughly 
tested to be used elsewhere to save time and cost. 
Most importantly, the system will be robust and 
easy to be maintained. Note that the carry shown on 
the table only lists the first addition. 


Table Subtraction on Two 4-Bit Numbers 


7 ! ! 
Carry Sing-and- One's Two's 
DNanimal NMoaanituda Camoaulamant Camnilamant 
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0100 


The subtraction for the decimal numbers and the 
sing-and-magnitude method will be similar to each 
other. A decimal arithmetic technique applies but 
before the process starts, the numbers must be 
compared to decide the sign of the result and the 
switch the position of the two numbers if necessary. 
For example, give two positive numbers, x and y, 
the result of x—y is —(y—x) if y>x. There are cases 
to consider for any two numbers as shown in . When 
two numbers are of different sign, i.e., one is 
positive and the other is negative, doing the 
subtraction is actually adding their absolute values 
and the resultant sign is determined by the first 
number. These representations though natural 
complicate the hardware design for subtraction as it 
requires a lot more case analysis compared to two’s 
complement and one’s complement methods, and 
also involves additions as well. 


Table Subtraction on Decimal Numbers and Sign- 
and-Magnitude Methods. P indicates a positive 
number and N denotes a negative number. T stands 
for true and F represents false. 
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Overflow 


In the 4-bit binary number system using the two’s 
complement method, the numbers represented are 
from — 87. It is possible that adding two numbers 
results in a number outside the range, i.e., the result 
may not be represented possibly using the 4-bit 
space. In this case, an overflow occurs and the result 
is wrong. The computer should flag an error and no 
further computation should be proceeded. For 
example, 3+6=9>7, the sum of 3 and 6 is 9, 
which is larger than the maximal number a 4-bit 
register can represent, i.e., 7, if two’s complement is 
used. On the other hand, —3+(-—6)= —49, the 
result falls outside the range — 87 and is not able to 
be represented. In this case, it is called underflow, 
i.e., anumber that is less than the smallest possible 
value the 4-bit register can represent. However, 


summing two number of different sign will not 
cause overflow, neither does subtracting a number 
from another of the same sign. It is important to 
find out when the overflow/underflow occurs. Since 
we could use two’s complement for subtraction, 
here only additions of two numbers are considered. 
In the following discussion, assume the two’s 
complement method is used to represent negative 
numbers. shows the overflow condition for summing 
two 4-bit numbers a and b represented by two's 
complement method with examples. 


Table Overflow Condition for Summing Two 4-Bit 
Numbers Represented by Two's Complement 


Signa Signb Sign Overflow Example 


Nn fa) T A11N+1NN1 
Vv Vv a a Vvanavi vv 
fal 1 Cv A1TNA 1.110 
Vv a a vavuvvuvi 24a 
1 n Cv 11101011 
a1 .o7 a baavi vasa 
1 1 0 T 1101+101 


The result shows that the overflow is dependent on 
the sign bits of the two numbers, and the result. An 
overflow occurs when summing two positive 

numbers but the result is negative, or summing two 


negative numbers but the result is positive. The 
Boolean expression for the overflow is as follows: 


Overflow = Sign a “ Sign b “ Sign result + Signa 
Sign b Sign result 


Note that the Boolean expression may not be further 
simplified by the Karnaugh maps. 


Real Numbers 


In decimal numeral representation, the integral part 
and the fraction part are separated by a period. 
Performing arithmetic operations will need to line 
up with the decimal point. Like the decimal 
representation, fractions in binary numerals are 
separated by a period (“binary point”) in visual 
representation. The difference is that the fraction 
digits in decimal are a negative power of 10 but in 
binary they are a negative power of 2. As with the 
decimal fractions, not all fractions can be 
represented exactly, but we could get arbitrarily 
closer by using more fraction bits. 


For example, adding decimal real numbers 
3.25+1.5=4.75 in binary format is illustrated in . 
Once the decimal or binary point is lined up, the 
normal calculation technique applied. The carry 
works for the next position in fraction part. This can 
be verified easily by (1 +1)x2-i=2x2-i=2-(i 


—1). So any carry from the current position should 
be added the left position. 


Table Adding Two Binary Real Numbers 
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Octal Numerals 


If the set of symbols {0,1,2,3,4,5,6,7} is chosen, and 
the base is equal to 8, an octal numeral system is 


formed. For example, 138 =1110 because 

138=1 xX 81+3x80=1110. This system is actually 
a base-8 numeral system. In Unix operating systems, 
octal numbers are used to represent file permissions. 
An octal number 750 associated with a Unix file 
means that the owner of the file is allowed to read, 
write and execution, group users are allowed to read 
and write, and others will not be permitted to do 
anything about the file. 


Advantages of octal numerals are 1) its symbols are 
drawn from Arabic numbers, 2) it can be converted 
to and from binary numerals quickly, and 3) it is an 
ideal abbreviation for binary numbers. A later 
section is devoted to the conversion. Early machines 
like IBM mainframes employed 24-bit or 36-bit 
words. Since each octal number digit requires 3 
binary bits, it only requires 8 octal digits (12 octal 
digits) to display a 24-bit (36-bit) word. The use of 
the seven-segment displays for a console can be 
greatly reduced. 

[missing resource: draw_odg0.png] 


Hexadecimal Numerals 


Hexadecimal is a base-16 positional numeral system. 
The set of symbols used are {0 —9,A,B,C,D,E} or 

{0 —9,a,b,c,d,e} and the radix or base is 16. The 
introduction of letters, though confusing, is 
necessary because the hexadecimal numerals require 
16 different symbols. The letters denote numbers 10 


(A) to 15 (E), respectively. For example, 
A116=AxX161+1x160=10x161+1x160=16110. 


The introduction of hexadecimal is primarily for 
human-friendly representation of binary code values 
in computing and digital electronics. A byte (8-bit) 
represents values from 0 to 255. It is more 
convenient to use two hexadecimal digits range 
from 00 to FF. Therefore, hexadecimal numerals are 
used extensively in program code, memory address, 
and content of memory. 


Compared to octal numerals, hexadecimal numerals 
are further compressed the binary numbers. Each 
hexadecimal digit requires 4 binary bits. Therefore, 
a 24-bit word (36-bit) requires only 6 hexadecimal 
digits (9 hexadecimal digits). The only problem, 
though, it is funny to display a hexadecimal digit in 
a seven-segment display as shown in . 

[missing resource: draw_odg1.png] 


Numeral Conversions 


Give a number represented in a numeral system, it 
is straightforward to convert it to a decimal number 
simply follow the equation an—1an—2an-—3... 
a2al1a0.b0b1...bm — 1bm=i=On-—1ai Xri 

+ Xj =Ombj x 10 —j, where ai ’s are the integral 
digits, bj ’s are the fraction digits, and r is the radix. 
It is harder to do the reverse conversions. Consider a 
conversion from a decimal integer to a base- r 


numeral system, d=bn—1xX2n—-1+... 

+b1 xX 21+b0 x 20, where d is a decimal integer 
and bi ’s are binary digits. Our goal is to find all the 
binary digits bi ’s. A closer look at the equation will 
find that the last term in the right hand side is 
actually bO, which is either 0 or 1. If we try to 
divide both sides by 2, the following equation is 
derived: 


d2=(bn-1xX2n-1+..+b1x21+D5D 
OX 20) 7/2 


Let q be the quotient and dO be the remainder of 
d/2. In the right hand side, all terms except the last 
one are divisible by 2. We have the following: 


2¢q°—.00 S20 bn HL Xk 2a 2 1 x 
20)+b0 


Based on this relation, bO = dO, i.e., the remainder of 
d/2. Since bo = dO, we can take it out from both 
sides of the above equation. The following relation 
is obtained. 


2qt+d0-d0=2(bn-1xX 2n-2+..,. + 
b1x20)+b0-—-b02qg=2(bn-1X2n 
= 2 io DT X20) q=bn=— 1X2 —.2 
+...+b1 x 20 


The equation shows the relation between q and its 
binary representation, which is similar to the 
original equation. Again, all we need to do is find 


the remainder of q/2, and let b1 be the remainder. 
The rest of the bi ’s will be obtained in this manner 
repeatedly. In general, any bases can be computed 
in this way. For example, converting 57 to binary 
involves the steps shown in . Here we keep a long 
division upside down and only write down quotients 
and remainders. The remainders are kept in the 
right hand side. Keep doing this process until the 
quotient is less than the base, or 2 in this example. 
The result reads from the bottom up. In this case, it 
is 111001, which is the binary representation of 57. 


Table Converting 57 to Binary 
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By and large, we could convert numbers in any 
order. However, practically, converting decimal 
numbers to a number of a larger base will be faster 
than that of a smaller base. Therefore, the diagram 
in shows the suggested order of numeral 


conversions. It is possible to convert a decimal 
number to binary, and vice versa. However, it 
requires more steps than going through hexadecimal 
or octal numerals. Also, it is error-prone doing that 
as more steps complicate the calculation. Another 
link from hexadecimal to octal is missing because 
doing operations directly will be much harder than 
going through binary or decimal. Practically, it is 
much easier to perform hexadecimal to/from octal 
conversions via binary. 


Decimal to Hexadecimal Integers 


The conversion from decimal to hexadecimal 
involves divisions by 16, which is the base of 
hexadecimal numerals. Like binary conversions, all 
we need is find the remainders during the divisions. 
For example, convert 1234 to hexadecimal as shown 
in . The quotient 77 is obtained by performing the 
division 1234/16 on a side (not shown on the 
table). The readers should work this on a piece of 
scratch paper to avoid errors. As can be seen that 
hexadecimal conversion from decimal involves 
much less divisions, the example only takes 2 
divisions. The result is read from bottom up again. 
Thus, 1234 in hexadecimal is 4D2. The result can be 
verified by 4 x 162+ 13 x 161+2=1234. 


Table Converting 1234 to Hexadecimal 
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Hexadecimal to Decimal 


Based on the definition, hexadecimal to decimal is 
straightforward. We only need to compute the sum 
of products. For example, converting 10216 to 
decimal as follows: 


lex 162 + 0 X:16.1-+ 2: xX: 160 = 256-4 0+. 2 
= 258 


After server conversions, you will remember the 
powers of 16. It pays to remember these powers as 
they will help you convert numbers fast. Here are 
the first several powers of 16 worth remembering. 


161 = 16162 = 256 16 3 = 4096 16 4 = 65536 


These numbers are used frequently in computer 
science. So just remember them. 


Hexadecimal to Binary 


From hexadecimal to binary, all we need is convert 


each individual hexadecimal digit to binary. This is 
done. Why? Here is the explanation. We first look at 
the definition of a hexadecimal number. 


hn—1hn—-2...h1hO=hn—-1x16n—-1+... 
+h1 x 161+h0 where hi ’s are hexadecimal digits. 


If we convert each hexadecimal digit to binary, the 
following equation is derived: 


hn—-1hn-2...h1hO=(b4(n—-1) +3 X 23+ b4(n 
—1)+2x22+b4n-—1)+1x21+b4M 
—1)x20)x16n-—1+... 

+(b7 xX 23+ b6 X 22+b5 x 21+b4 x 20) x 161 + (b3 x 23+ 
where hi=b4i+3 x 23+ b4i+ 2 x 22+ b4i 
+1x21+b4i x 20. 


The above equation can be expanded to a binary 
representation if we replace 16 by 24 as follows: 


hn-lhn-2...h1h0=2Xi=04(n-1) 
+3bi xX 2i 


Note that there are total 4n terms as an n-digit 
hexadecimal will be converted to 4n-digit binary 
include the leading zeros. For example, the 
hexadecimal A1F2 will be converted to 1010 0001 
1111 0010 as illustrated in . 


A 1 T ta} 


1010 0001 1111 0010 


Figure An Example of Converting Hexadecimal 
A1F2 to Binary 


Binary to Hexadecimal 


The derivation for converting hexadecimal to binary 
in the previous section is reversible. In other words, 
each step in the derivation is an “if-and-only-if” 
step. The details are left for the readers. So, from 
binary to hexadecimal, we can just group the binary 
digits into groups of 4 starting from the right most 
digit. depicts an example that converts 1000 0010 
1100 0011 to hexadecimal. 


Figure An Example of Converting a Binary 
Number to Hexadecimal 


Decimal to Octal 


Decimal to octal conversion is very similar to 
decimal to hexadecimal. The only difference is the 
divisor is 8 instead of 16. Let’s now convert 1234 to 
octal as shown in . The result is 2322 and can be 
verified by 2x 83+ 3X 82+2x81+2 x 80=1234. 
Again, you may want to remember the powers of 8 
to accelerate the computations. Here is a list: 


82 = 6483 = 51284 = 4096 85 = 32768 


Table An Example of Converting 1234 to Octal 
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Binary to Decimal Integers 


Converting a binary number to decimal is done by 
the definition, which is an—1an—2an-—3... 
a2ala0 = Xi=O0n-—1ai xX 2i=a0 X 20+ al X 21+... 


+an—1X2n—1, where ai’s are binary digits. For 
example, the binary number 1011001010 will be 
converted as follows: 


Table An Example of Converting 10 1100 1010 to 
Decimal 
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An Example of Converting 10 1100 1010 to Decimal 


[missing_resource: tb3] 


Converting binary numbers to decimal involves 
calculations of powers of 2, multiplications, and an 
addition as illustrated in . For small numbers, such 
as 4-bit binary numbers, it is fine to do it this way. 
To quickly convert a small binary number using this 
approach, you can memorize the values of the small 
powers of 2, e.g., 2028 as shown in , and add them 
together if the corresponding binary bits are one. 
For example, 1010 is equal to 8+ 2=10. The 4 and 
1 are not added because their corresponding bits are 
zero. 


However, the convert a large binary number, we 
would have to calculate the same number of powers 
of 2 as the number of binary digits. In a typical 
computer system, it is normal to have 32-bit or 64- 
bit registers to store numbers. Therefore, direct 
conversion is error-prone. Instead, practically, 
convert binary numbers to hexadecimal first, and 
then convert hexadecimal to decimal. Binary to 
hexadecimal conversion is fast and involves 2 steps: 


1. Group every 4 bits from the right, and 
2. Convert each group of 4 bits to hexadecimal. 


The readers are referred to a previous section for 
more detail. 


Converting the translated hexadecimal to decimal 


involves much less computations and will be much 
faster than direct conversion. In the example shown 
in , the conversion via hexadecimal is illustrated in . 


Table An Example of Converting a Binary Number 
to Decimal by First Converting to Hexadecimal 
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Real Numbers 


A decimal real number is composed of the integral 
part and the fraction part. Conversion of a decimal 
number to or from other representations can be 
performed on the integral part and the fraction part 
individually. The integral part will be converted 
using techniques described in other sections. In this 
section, we will focus on the fraction part 


conversion. 


According to the definition, an— 1an—2an-—3... 
a2a1a0.b0b1...bm —1bm =Xi=On-—1ai X ri 

+ %j =Ombj x 10 —j, where ai ’s are the r-nary digits 
of the integral part, and bj ’s are r-nary digits for the 
fraction part, and r is the base, converting a number 
of a representation to decimal will be computing the 
sum. For example, the conversion of the number 
1.2348 is depicted in . Note the a period is used to 
separate the integral part from the fraction part. 


Table An Example of Converting an Octal Real 
Numbers to Decimal 


Powei's 8 0 . 8-1 8-2 8-83 
ivchace a 62125-—9,0156290.001953105 
Octal 1 2 3 4 
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When manually perform the calculation for the 


conversion, it may be good to use a fraction 
notation that has a numerator and a denominator 
like n/d, where n denotes numerator and d 
represents denominator. In the above example, steps 
for a manual conversion are illustrated in . 


Table An Example of Manually Converting an Octal 
Real Number to Decimal 


Powei's 8 0 ’ 8-1 8-2 8-3 
~LlO 

wi Oo 

issecatdepael laf 164 151 
Octal 1 2 3 4 

Ni aita 

wi16iw 


bjx31 28 364 4512 
~ 
Fai+—t-+ 28 
Ebj +364 

+4512 

=1+ 

128 512 

+ 24 

512 + 4 

512 =1 

+ 156 

512 = 

1.3046875 


Now, let’s turn to the conversion from decimal 
numbers to r-nary numbers. Remember that 
converting the integral part of a decimal number 
involves divisions. The fraction part conversion is 
done using multiplications. Consider an r-nary 
numeral system and a fraction in decimal f and f<1, 
we have the following 


f=bl1 xr—1+b2xr—2+...+bn—1xr—(n-1). 


Our goal is to find all the bj ’s. If both sides of the 
above equation are multiplied by r, this yields 


fxr=bl+b2xr—1+...+bn—1xr—(n-2). 


To satisfy the equality, we can pick b1 to be the 
integral part of fxr, and the fraction part of fxr 
will be the rest of terms in the right hand side, i.e., 
b2xr—1+...+bn—1xr—(n-2). Follow this 
procedure, we can get b2,b3,...,bn—1. For example, 
converting the decimal 0.03046875 to octal is 
illustrated in . In the first row, the multiplication 
result is 2.4375, which is composed of the integer 2 
and the fraction 0.4375. The integer 2 will be the 
first positional octal fraction digit, and the fraction 
will be used for finding the second positional octal 
fraction digit. Therefore, the result confirms that 
0.0304687510 = 0.2348 as illustrated in the 
previous example. 


Table An Example of Converting a Decimal Fraction 
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Converting decimal fractions to binary or 
hexadecimal will be similar to the example shown 
here. For binary conversion, the multiplicand used 
is 2; for hexadecimal conversion, the multiplicand 
should be 16. 


Binary to Octal 


Like binary to hexadecimal, binary to octal can be 
done quickly by a “group-by-three” method. The 
rationale behind this is similar to that behind the 
binary to hexadecimal and is reiterated here. 


on—lon-—2...0lo0=on—1x8n-1+... 
+o1 X 81+00 where oi ’s are octal digits. 


If we convert each octal digit to binary, the 
following equation holds: 


on — lon—2...0100= (b3(n—1) + 2 X 224+ b3(n 
—1)+1x21+b3(n-—1) x 20) x 8n—-1+... 
+(b5 x 22+ b4 x 21+ b3 x 20) x 814+ (b2 x 22+ b1 x 21+b 
where oi = b3i+ 2 xX 22+ b3i+ 1 x 21+ b3i x 20. 


The above equation can be expanded to a binary 
representation if we replace 8 by 23 as follows: 


on-—lon-2...0l1l00=2Xi=03(n-1) 
+2bi X 2i 


Note that there are total 3n terms as an n-digit octal 
will be converted to 3n-digit binary include the 
leading zeros, and each step in the equation is 
reversible. Therefore, the “group-by-three” approach 
can be applied to conversions of both directions. For 
example, given the binary 010011111100 (a space 
is used to separate the groups of 3), illustrates how 
to convert it to the equivalent octal number. The 
binary 010011111100 is equal to 23748. 


Figure An Example of Converting Binary to Octal 
Using the “Group-of-Three” Approach 


Octal to Decimal 


Following the definition, octal to decimal can be 


converted by calculation the summation. For 
example, 23228 = 123410. The detailed conversion 
is illustrated in . 


Table An Example of Converting Octal to Decimal 
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Octal to Binary 


Octal to binary is a reverse conversion of the binary 
to octal method shown in the previous section. All 
we need is map each octal digit to its binary 
representation. It is recommend to memorize the 
mapping for fast translation practically. lists the 


mapping form octal digit to binary. 


Table A Mapping from Octal Digit to Binary 
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An example of converting octal numbers to binary is 
depicted in . The conversion is done by simply 
mapping each octal digit to its binary representation 
according the mapping table. The octal number 
34568 = 0111001011102. 


Figure An Example of Converting an Octal 
Number to Binary 


Remarks 


Numeral conversion involves radix division for the 
integral part and radix multiplication for the 
fraction part of a number. The larger the radix, the 
less number of digits in representing a number. 
Hexadecimal is typically used to represent memory 
addresses and contents in operating systems for its 
succinct notation. In a place where the radix is 
assumed to be well-defined, it is omitted in the 
subscript of a number. Radix can be any integer but 
in computer science, frequent uses are base-2 
(binary), base-8 (octal), base-10 (decimal), and 
base-16 (hexadecimal). Numbers in a radix n are 
also called n-nary numerals or base- n numbers. The 
bottom line is different representations of a number 
have the same mathematical value, which is equal 
to the number per se. 


Exercise 


1. Convert the following unsigned binary 
numerals to decimal: 


11111 00000 10101 10001 101 001 


1. Convert the following decimal numbers to 
binary: 


1111 0000 135 123 1024 16 

1. Convert the following octal numbers to binary: 
Loo LIL 73 777-0 1 

1. Convert the following binary numbers to octal: 


101 010 111 001 111 111 111 111 011 011 011 
01101101111 001 


1. Convert the following decimal numbers to 
octal: 


1024 512 99 111 127 


1. Convert the following octal numbers to 
decimal: 


135 333 777 111 101 2345 


1. Convert the following binary numbers to 
hexadecimal: 


1000 1000 1000 1000 1010 0010 0100 1111 1 
0001 1001 1111 


1. Convert the following hexadecimal numbers to 
binary: 


abcdef 3 A 2 B FACE BAD AAD FADE 2 


1. Convert the following decimal numbers to 
hexadecimal: 


1023 65535 4321 1111 13579 


1. Convert the following hexadecimal numbers to 
decimal: 


abcdef 3 A 2 B FACE BAD AAD FADE 2 1234 11111 


1. Which of the following binary numbers are 
even? How can you tell if a binary number is 
even or odd? Note that a number divisible by 2 
is even. Otherwise, it is odd. 


1010010101 1111111000 
101010101010101010101 
10000000000000000000000001 
11111111111111111111111111111 
LEDLELI LIAL AAT ALIA 10 


1. Convert 102310 to binary and negate it using 
two’s complement. Compute -1023 + 1023 in 
binary. What result do you expect and why? 

2. Convert the decimal fraction 1516 to binary. 
Use a “binary period” to separate the integral 
part and the fraction part. 

3. Describe how to perform subtraction using 
addition. Argue why the result is correct. 

4. Which of the following hexadecimal numbers 


are even? Note that a number divisible by 2 is 
even. Otherwise, it is odd. 


abcdef FFFFFFFFFFFFE AAAAAAAAAAAAA 13579 
2468 CCDDEE ABABABABAB BABABABABA 


1. In what numeral representations are there two 
notations for zero? 

2. Given n bits, how many signed numbers can be 
represented using the sign-and-magnitude 
method, the ones’ complement method, and the 
two’s complement method? 

3. In two’s complement method, why is there one 
more negative number than there are positive 
numbers? 


Floating Points 

This chapter explains how data is represented in a 
machine. Compare representations of integers to 
floating point numbers. Describe underflow, 
overflow, round off, and truncation errors in data 
representations. IEEE 754 standard for floating point 
numbers including NaN and denormal numbers is 
introduced. Floating-point arithmetic and floating- 
point hardware design. 


Floating Point Numbers 


Computer programs such as scientific applications 
often involve numbers of very large, e.g., the 
number of atoms a molecule in Chemistry, or very 
small number close to zero, e.g., the charge of an 
electron in Electronics. For very large numbers, they 
may not be represented by a 32-bit unsigned 
number, which represents numbers up to (about 10 
decimal digits). Even with a 64-bit unsigned integer, 
it can represent numbers up to about 20 decimal 
digits. On the other hand, small numbers with a 
large number of fractional digits (those to the right 
of the decimal point) may not fit into the fixed 
amount of digits in integer representation. 


The need for representing big and small 


real numbers 


It is not uncommon for computers to handle very 
big or very small numbers. For example, in 
Chemistry, the number of atoms in a molecule is 
approximately . In Electronics, the charge of an 
electron is . Obviously, these numbers may not be 
represented in integer format. In order for 
computers to perform computations over these 
numbers, another approach to store these numbers 
in memory is needed. 


Scientific Notation 


Scientists have discovered the representation 
problem for very large and very small numbers. 
Therefore, a succinct notation is developed to 
denote these numbers. If we write out the number , 
it will be , which has 24 decimal digits. The 
representation of the number is a scientific notation, 
which has the following format: 


( times to the power of ) where is a real number 
called the significand or mantissa, and the exponent 
is an integer. A negative number will have a 
negative mantissa, whereas a negative exponent 
indicates a number close to zero. For example, , , 
and . It can be seen that digits to the left are more 
significant than others. For example, is not a big 
difference compared to but the difference between 


and is 10 times more than that of the former two 
numbers. The same scenario applies to the negative 
exponent case. 


Though the scientific notation is terse, the 
arithmetic would have to be developed to correctly 
perform their operations. Addition or subtraction of 
two numbers in scientific notations requires the 
same exponent. If the exponents are not the same, 
one of the two numbers has to be adjusted. So their 
exponents are identical. Given two numbers , and in 
scientific notations, the following shows addition 
and subtraction of the two. 


Multiplication and division of two numbers in 
scientific notations will not require the same 
exponent. They can be operated directly on 
mantissas and exponents. Given two numbers , and 
in scientific notations, the following shows addition 
and subtraction of the two. 


For multiplication, the mantissa of the product is 
the multiplication of those of the multiplicand and 
the multiplier; the exponent of the product is the 
sum of those of the multiplicand and the multiplier. 
As to division, the mantissa of the quotient is equal 
to the mantissa of the dividend divided by the 
mantissa of the divisor; the exponent of the quotient 
is equal to the subtraction of those of the dividend 
and the divisor. 


Normalization 


It is obvious that any number can be written in 
many scientific representations. For example, may 
be written as , or , or many others. In normalized 
scientific notation, the absolute value of the 
mantissa is limited to be at least 1 and less than 10 
such that 


where . 


Therefore, the above number in normalized 
scientific notation is not or . Normalization provides 
an easier way of comparing two numbers of the 
same sign as the exponent indicates the order of 
magnitude. In normalized scientific notation, if the 
exponent is negative, the absolute value of the 
number is between 0 and 1 as the following 
equations suggest. 


For example, is and its absolute value in between 0 
and 1. Note that 0 can be written as where can be 
any arbitrary integer such as, , , etc. However, they 
may not be normalized as the mantissa has to be 0, 
which violates the normalization definition that . 


IEEE Standards for Floating Point 
Numbers 


Based on the scientific notation, numbers stored in 


computers can employ the same format. In early 
computer design, the floating point format varies 
from manufacturer to manufacturer. This may cause 
incompatibility should data were to be exchanged. 
The Institute of Electrical and Electronic Engineers 
(IEEE) published the IEEE standard for floating- 
point arithmetic (EEE 754) in 1985 and 2008. The 
IEEE 754 has been used to deign CPU and FPU in 
almost all computers. Many software systems 
implement some or all arithmetic using the IEEE 
754. The standard defines arithmetic formats, 
interchange formats, rounding algorithms, 
operations, and exception handling. In the 
arithmetic formats, the IEEE 754 defines floating 
point formats, which consist of finite numbers, 
infinities, and “not a number” values (NaNs). 


1. Binary32 Single Precision Format 


There are five basic formats defined in the IEEE 754 
standard. We will discuss two of them that are 
frequently used in computer systems, named the 
binary32 (single precision) format and the binary64 
(double precision) format. The binary32 format uses 
32 binary bits to represent a floating point number 
whereas the 64binary used 64 bits. 


Since scientific representation for numbers is not 
unique, the IEEE 754 standard defines the 
normalized binary numbers. Like the scientific 
notation, the normalized binary number has the 


following form: 


where . Thus, the absolute value of is something like 
where could be either 0 or 1. Table 1 illustrates the 
IEEE 754 standard for the binary32 single precision 
format. The single precision format defines 1 bit 
(MSB) for sign, followed by 8 bits for exponent, and 
followed by 23 bits for mantissa or fraction. Totally 
there are 32 bits in the single precision floating 
format. A float keyword is used to specify a single 
precision number in most languages like Java or C. 


Table 1 The Binary32 Single Precision Floating 
Point Format in the IEEE 754 Standard 
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A sign bit 1 indicates the number is negative 
whereas a sign bit 0 denotes a positive number. 
Since the significand is normalized, and it always 
starts with an 1, the leading one is not necessary to 
take a space in the representation. Thus, there is an 
one in when restoring the number. The 8 bits in the 
exponent field may represent unsigned values from 
0 to 255. However, in order to represent negative 


exponents, a bias is defined as 127. So a number 
with an exponent 1 in the IEEE 754 single precision 
format will have an actual exponent . Exponents 
with all zero’s and all one’s are reserved for special 
purposes. Therefore, the exponents range from to. 


For example, the decimal can be written to a binary 
scientific notation as . In the normalized form, the 
sign bit is 1, the fraction is 1, and the exponent is . 
Because of the bias, we will have to add 127 to the 
exponent, i.e., the exponent shown in the format 
should 126. Table 2 shows the IEEE 754 single 
precision representation for the decimal number . 


Table 2 The IEEE 754 Single Precision 
Representation for -0.75 


bits 4 & 23 
& Exponent —Erretion 
values 1 0111 1110 100 0000 
0000 0000 
0000 0000 


1. Single Precision Range 


With the space of 32 bits, due to the fixed precision 
arithmetic, there is a range of numbers that can be 


represented by the IEEE 754 single precision format. 
Let’s first consider the small possible number that 
the format can represent. In order to make a 
represented number smaller, the fraction should be 
all zero’s, the exponent should be the smallest 
possible negative number, and the sign has to be 
negative. The smallest exponent is as mentioned 
early. The number is shown in Table 3. The smallest 
number is , which is about . 


Table 3 The Smallest Number Represented by the 
IEEE 754 Single Precision Format 


aan ¢ ol See Behetion 
values 1 0000 0001 000 0000 
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On the other hand, the largest number that the IEEE 
754 single precision format represents will have the 
following constraints: 


1. The sign is positive, i.e., the sign bit is 0, 
2. The exponent is maximum, i.e., 127, and 
3. The fraction bits are all one’s. 


Table 4 illustrates the maximal number represented 
by the IEEE 754 single precision standard. The 
number represented is about , which is about . 


Table 4 The Maximal Number Represented by the 
IEEE 754 Single Precision Format 


— ¢ Esbponent Brketion 
values 0) 1111 1110 1111111 
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a ae 
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Thus, the range of single precision numbers in the 
IEEE 754 standard is . Numbers outside the single 
precision range may not be represented using the 
IEEE 754 single precision standard. It is worth 
mentioning that numbers with more than 23 
fraction bits may not be represented in this single 


precision format. Instead, a higher precision format 
such as the IEEE 754 double precision format is 
needed. 


Binary64 Double Precision Format 


When numbers outside the single precision range 
are needed for computation, more space is required 
to store them in computer systems. The IEEE 754 
standard defines a binary64 double precision format 
that requires double the size of that used in the 
single prevision format. The 64 bits are allocated for 
the sign bit, the exponent, and the fraction as 
depicted in Table 5. 


Table 5 The Binary64 Double Precision Format in 
The IEEE 754 Standard 
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where 


Like in its single precision version, the exponent in 
the IEEE 754 double prevision standard reserves all 
zero’s and all one’s for special purposes. Therefore, 


the smallest exponent in the format is and the 
largest is . With the bias , the IEEE double precision 
standard can represent numbers with an exponent 
ranging from . 


For example, the decimal is translated to in binary. 
After normalization, it is . Therefore, the sign bit is 
one, the faction is , and the exponent is . Note that 
we add the bias to the exponent when a number is 
translated to the IEEE double precision format. 
Table 6 shows the IEEE double prevision 
representation for the number . 


Table 6 The Number -12.625 Represented in the 
IEEE Double Prevision Format 
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1001 0100 
0000 0000 
0000 0000 
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It is convenient and readable to group the digits in 
four with a separated space as there are lots of 
zeros. There are 13 groups of 4 in the fraction part. 
The fraction digits are filled from the left to the 
right. This way, some errors may be avoided. Also, 
when hexadecimal is needed, they are readily to be 
converted to hexadecimal. 


Double Prevision Range 


Even with 64 bits, there is a limit for the IEEE 754 
double prevision format. The smallest number that 
can be represented in this standard is as follows: 


1. The sign is negative, i.e., the sign bit is 1, 
2. The exponent is the smallest, i.e., , and 
3. The fraction bits are all zero’s. 


Therefore, the smallest number is , which is about . 
Table 7 illustrates the smallest number in the IEEE 
754 double precision standard. 


Table 7 The Smallest Number that May be 
Represented by the IEEE 754 Double Precision 
Standard 
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values 


The number 
represented 
is . 


On the other hand, the largest number that may be 
represented by the IEEE 754 double precision 
standard is as follows: 


1. The sign is positive, i.e., the sign bit is 0, 
2. The exponent is the largest, i.e., , and 
3. The fraction bits are all one’s. 


The largest number is about , which is about as 
shown in Table 8. 


Table 8 The largest Number Represented by the 
IEEE 754 Double Precision Standard 
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1111 


oo oe oe 


The number 
represented 
is. 


Therefore, the IEEE 754 double precision standard 
will represent numbers ranging from to . 


Floating Point Precision 


The fraction bits in the IEEE 754 standard are 
significant. Therefore, The single precision format 
will have approximately , about decimal digits of 
precision. The double precision format will have 
approximately , about decimal digits of precision. 
Note that a number falls in the range of a format but 
may not be accurately represented due to the 
precision limit. For example, the number , when 
converted to the IEEE 754 single precision format 
will actually have a value of 999,999,936. So there 
is a precision error of 63 during the conversion. The 
tolerance of the precision error depends on 


applications. Some applications may lead to 
disasters in case of failure such as nuclear reaction. 
Those critical applications may have to be designed 
to include concerns of the precision errors. 
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The last 6 
one’s of the 
significand 
may not be 
stored in the 
fraction field 
due to the 
precision 
limit. 


Special Values 
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In Mathematics, a nonzero number divided by zero 
is infinite. A typical computer design will raise a 
“divide-by-zero” error flag, and let software system 
to handle the exception. A feather in the IEEE 754 


standard allows representations for the unusual 
events. Recall that the exponent in both the single 
precision and double precision formats reserves all 
zero’s and all one’s. The exponent with all one’s is 
reserved to represent the special values. Table 9 lists 
the special values defined in the IEEE 754 standard. 


Table 9 Special Values Defined in the IEEE 754 
Standard 


Sign Exponent Fraction Object 
WM wt 
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All one’s Nonzero NaN (Not a 
Number) 


Zero is considered as a special value because it may 
not be represented in the normalized scientific 
notation. So when both the exponent and the faction 
are all zero’s, it represents zero as shown in Table 9. 
However, the sign bit could be positive or negative, 
which leads to a drawback in the floating point 
representation that there are positive zero and 


negative zero. 


Invalid operations, such as , square root of a 
negative number, or , may produce a result that is 
not a number (NaN). In the IEEE 754 standard, NaN 
is represented with all one’s in the exponent and 
nonzero in the fraction. The sign bit is ignored in a 
NaN representation. 


Denormal Numbers 


The normalization limits the significand to be at 
least 1. Thus, the smallest absolute value of numbers 
that can be presented in the IEEE single precision is 
. There is a gap between 0 and . Denormal numbers 
(denormalized numbers or subnormal numbers) fill 
the gap. In IEEE 754 standard, the denormal 
numbers are represented as all zero’s in the 
exponent but nonzero in the fraction. The smallest 
positive denormal number in the IEEE 754 single 
precision is illustrated in Table 10 . The value of the 
number is , which is equal to . The largest negative 
number (closest to zero) in this format is . 


Table 10 The Smallest Positive Denormal Number in 
the IEEE Single Precision Representation 
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¢ Esbonent Fehotion 
values 0 0000 0000 000 0000 
0000 0000 
0000-9001 
The number 
represented 


1S . 


For the double precision, the smallest positive 
denormal number is , and the largest negative 
number (closest to zero) is . 


Exercise 


1. Compute the following floating point addition, 
normalize the result, and write the fraction as a 
finite precision number with six decimal digits 
of precision, i.e. six digits to the right of the 
decimal point. Explain why you got the result 
you did. What observation can you make about 
addition and subtraction of floating-point 
numbers? 


1. Convert to an IEEE 754 single precision floating 
point number. Show the sign bit, the exponent, 
and the fraction. 

2. Convert to an IEEE 754 double precision 
floating point number. Show the sign bit, the 
exponent, and the fraction. 


3. What is underflowand overflow? 
4. What is the essential idea behind scientific 
notation and floating-point numbers? 


Boolean Algebra 

This chapter reviews Boolean algebra with an 
emphasis on Karnaugh maps for simplifying Boolean 
expressions. Boolean algebra as the calculus of two 
values is fundamental to computer circuits, 
computer programming, and mathematical logic, 
and is also used in other areas of Mathematics such 
as set theory and statistics. Therefore, it is a crucial 
subject in Computer Science. In Part 2, we focus on 
the principle of simplifying Boolean expressions. 
This process will have a direct impact on low cost/ 
low redundant hardware design. 


Boolean Algebra 
Boolean expressions are defined as follows: 
Definition 1: A Boolean expression can be the following 


* A Boolean variable, 

¢ A relational expression, or 

¢ Any combination of Boolean variables, 
relational expressions, and Boolean expressions 
with logical operators AND, OR, NOT, and 
parentheses. 


A Boolean expression will be evaluated to either 
true or false if all of its variables have been 
substituted with their actual values. For example, a 


Boolean variable x is a Boolean expression, and it 
can be evaluated to true or false subject to its actual 
value. Normally, for two Boolean variables x and y, 
xy denotes x and y, x+y indicates x OR y, and ~x 
represents negation of x. They are all well-defined 
Boolean expressions because x and y are Boolean 
variables. Parenthesized items will be evaluated 
first. 


Operator Precedence 


The operators used in Boolean expressions are 
relational operators (<<, <=, ==, >=, >=, and 
!=), and logic AND, OR, NOT. If there are more 
than one ways in evaluation, it may be ambiguous 
or in some cases invalid. For example, x + yz can 
be evaluated to x + (yz) or (x+y)z, i.e., evaluate y 
AND z first, or x OR y first. Another example like x 
+ m<n, where m and n are numbers, may cause 
syntax error should x + m were to be evaluated 
first. In this case, logical OR is being applied to a 
number m but in the above definition, logic OR can 
only applied to Boolean expressions and n is a 
number not a Boolean expression. Operator 
precedence dictates the unique order in evaluating a 
Boolean expression: from left to right if operators 
are in the same precedence, or the high precedence 
operator first. The following table lists the operator 
precedence order for the above operators: 
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Based on the operator precedence order, x + yz will 
be only evaluated to x + (yz). 


Laws of Boolean Algebra 


Boolean expressions can be rewritten to other forms 
which do not change their semantics. For example, a 
double negation of a Boolean variable is itself: NOT 
NOT x = x. The “=” sign indicates equivalence of the 
left hand side Boolean expression and the right hand 
side Boolean expression. The equivalent relation can be 
proved by the following truth table: 


hu CD $e 


The truth table clearly shows that x and NOT NOT x 
contains identical column. 


Distribution Law 


An operand of AND can be distributed among OR 
operators. For example, x(y+z) = xy + xz. The 
operand x is distributed inside the parenthesized term y 
+z. The following truth table shows this equivalence. 
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In Algebra, the distribution law applies as well. For 
example, 2*(3+ 6) = (2*3) + (2*6) = 18. Now, 
let’s consider another Boolean expression from the one 
we have seen by exchanging the operators, i.e., x + (yz). 


Will x+ (yz) = (c+ y)(x+2)? Again, let’s find out 
from the truth table. 
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mi CD CD CD Fa CD CD CD 
mY ku bY ke ke CD OD OD 
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Yes. The distribution law applies for this expression in 
Boolean Algebra. However, in Algebra, 2+ (3*6) = 
(2+ 3)*(2+ 6) is false. The main reason is that the 
Boolean Algebra does not consider magnitude but the 
Algebra does. Nevertheless, the distribution law applies 
to both AND to OR, and OR to AND in Boolean 
Algebra. 


Inverse Law 


A Boolean variable or its negation is always true, x OR 
not x = true. A Boolean variable and its negation is 
always false, x AND not x = false. 


Commutative Law 


The operand order for a Boolean operator does not 
matter. For example, xy = yx, andx+y = y+x. 


Associate Law 
The evaluation order for an operator does not matter. 


For example, (x+y)+z = x+(y+3), and (xy)z = 
x(yz). 


Identity Law 


A Boolean expression operated by itself will be itself. 
For example, x+x = x, and xx = x. 


Redundance Law 


Sometimes, some terms in a Boolean expression can be 
eliminated without changing its semantics. For example, 
x + xy = x, and x(x+y)=x. 


False Law 


False and anything is false. False or anything is 
anything. For example, Ox = 0, and0+x = x. 


True Law 


True or anything is true. True and anything is anything. 
For example, 1x=x, and 1+ x = 1. 


De Morgan’s Law 


De Morgan’s law states that a not operator outside a 
Boolean expression can get inside by negating the 
variables and change AND operator to OR and vice 
versa. For example, !(x+y) = !xly, and !(xy) = !x + 
ly. 


Simplification of Boolean Expressions 


Given a Boolean expression, we can simplify it using 
the laws described in the previous section. However, 
the simplification process may not be so obvious 
sometimes. A graphical representation of a Boolean 


expression in the Karnaugh map greatly helps the 
simplification process. 


Karnaugh Maps 


Karnaugh maps are good at simplifying Boolean 
expressions with less than 4 variables. Although 
simplifying Boolean expressions with more than 4 
variables is normally done by computers with 
sophisticated algorithms, however, those with a 
small number of variables occur frequently in 
practical designs. Thus, Karnaugh maps are worth 
learning. Additionally, Karnaugh maps provide a 
great deal of insight into digital logic circuit. 


In this section, some Karnaugh maps for two to four 
variables will be presented, and how they can really 
be used to simplify Boolean functions. 


Two-Variable Karnaugh Maps 


Suppose two variables, A and B, in a Boolean 
expression. Each of which has two possible values 
(true or false) and total number of possible 
combinations is four. These combinations are 
represented in a map similar to truth table as 
follows. 


DD’ DR 

yw Fe) 

a n 1 
var u 4 
A 2 - 


The number in the cell indicates the value of the 
binary representation of AB, which is called a 
minterm. Each cell is corresponding to a minterm, 
i.e., a product of the two variables. For example, AB 
“>10—2. Assume the Boolean expression AB’+ AB 
is to be simplified, where AB’—2, and AB->3. So we 
fill 1’s to the second and the third cells, which 
results in the following: 


hk 
ra 


The 1’s in the second row can be grouped together 
and A can be used to represent them. That says, the 
Boolean expression AB’ + AB can be simplified to 
A(B’+B)—A. If A’B is added to the above Boolean 
expression, i.e., AB°+ AB+A’B, the resultant 
Karnaugh map is as follows: 


DD’ DR 

yw ,e) 

a 1 
“an a 
A 1 1 


We now got the 1’s in the second row and the 
second column. The second row is corresponding to 
A. The second column is corresponding to B. 
Therefore, AB°+ AB+A‘B will be simplified to A 
+B. To show this is true. We only need to show A 

+ A’B=A+B, which is true because A+ A’*B=(A+A 
‘)(A + B) = 1(A+B)=A+B. 

[missing resource: draw_odg0.png] 
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Three-Variable Karnaugh Maps 


Likewise, three-variable Boolean expressions will be 
reorganized from their truth tables. Assume three 
variables A, B, and C in a Boolean expression. There 
are totally 8 combinations of possible values for 
three variables. They are organized in the following 
map. 


Ro’ R’-r Re Rr’ 
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The arrangement is based on the rule that only one 
variable changes in any two contiguous cells 
vertically or horizontally. Therefore, the numbers go 
from 0, 1, 3, 2, not 0, 1, 2, 3 because there are two 
changes from 1(B’ C) to 2 (BC’). With this special 
arrangement, the contiguous cells can be combined 
by eliminating the changed variable. For example, 
AB’C+ABC is represented in the following 
Karnaugh map: 


The 1’s in the shaded area can be combined to C as 
B’°C+BC=(B’°+B)C=C. Also, the second row 
indicates A is true. Therefore, AB’°C + ABC=AC. The 
process can be interpreted as set operations if the 
cells are divided according to Boolean variables. 


The graphics shown in Figure 1 will be used to 
simplify Boolean expressions. Variable A is 
associated with the second row; Variable C is 
associated with the middle two columns; Variable B 
is associated with the last two columns. The red 
oval is the interaction of the second row and the 


middle two columns. Thus, we can used AC to 
represent the two cells. Here AC reads as A 
intercepts C from the area perspective. The other 
areas, e.g., the first row, will be represented as 
negation of some variable. The first row is negation 
of A, i.e., A’. For example, A°B’C’+ A’BC’ is 
depicted as follows: 


The first row in Figure 2 is A’ and the two cells are 
outside C, i.e., C’. Thus, we can use A’C’ to 
represent the two cells. In case there are more 1’s in 
the Karnaugh map, continue this way until all 1’s 
are covered. The simplified form is called sum of 
product. This graphical technique will greatly 
simplify the Boolean expressions but the order in 
which the 1’s are filled is important. 

[missing resource: draw_odg2.png] 


Four-Variable Karnaugh Maps 


Let’s now consider the four variable case. There are 
16 possible combinations for the variables A,B,C, 
and D. What we want is arrange the variables on a 
2-demensional table in a way that any two adjacent 
cells only have one change in a variable vertically 
and horizontally. Therefore, we come up with the 
following setting. 


Con’  “77n cn cn’ 
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C BA DAgain, the numbers in the table cells 
represent the binary value of ABCD. Note that the 
order of then numbers goes from outward to 
inward. These numbers will help you fill in 1’s for a 
Boolean expression to be simplified. The graphical 
representation of the four-variable Karnaugh map is 
illustrated as follows. 


Figure Graphical Representation of the Four- 
Variable Karnaugh Map 


For example, AB‘C’D+ AB’CD will be tabularized in 
the following Karnaugh map. 


The numbers in the corresponding parentheses 
indicate the binary value of the corresponding 
terms. Practically, converting the terms to their 
binary values and fill in 1’s to their corresponding 
table cells may avoid unnecessary mistakes. Filling 
the 1’s directly by looking at row and column 
headings is error —prone. From the Karnaugh map 
above, the two cells are inside A and inside D but 
outside B. Therefore, they can be represented as AB 
‘D. Note that C is not involved because one cell (9) 
is outside C and the other cell (11) is inside C. The 


Boolean expression AB’C’D + AB’CD= ABD. The 
following proof shows the simplification is correct. 


AB’C’D + AB’CD= ABDC’ + AB’AC ; apply 
commutative law on both terms 


= AB’D(C’+C) ; apply distribution law 
AB’D1; apply inverse law 


AB’D ; apply true law 
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Five-Variable Karnaugh Maps 


In practice, simplification of a five-variable (or 

more) Boolean expression is done by computer 

programs such as VHDL. However, some compiler 
may not produce an optimal result. In that case, the 
five-variable Karnaugh map may help. Normally, we 
split the variables A,B,C,D,E into 2 groups: AB and 
CAE. We now arrange the 32 possible combinations 
in the following table. 
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The corresponding graphical representation is as 
follows. 


Given a Boolean expression ABC’AE + ABC DE 

“+ ABCDE’+ ABCDE. Let’s simplify it using the 
Karnaugh map. First, translate the terms to their 
corresponding binary values. ABC’AE = 27,ABC “DE 
“= 26,ABCDE’=30,ABCDE = 31. Then, fill in 1’s at 
the cell numbered 27, 26, 30, and 31 as depicted in 
the following diagram. 


1(30) 1(31) 126) 1(27) EEB CD A Figure The 
Karnaugh Map for the Five-Variable Boolean 
Expression ABC’AE + ABC DE’ + ABCDE’ + ABCDE. 


It is clear that the red oval area is the intersection of 
A, B and D. Therefore, the Boolean expression ABC 
““AE + ABC DE’ + ABCDE’ + ABCDE= ABD. 
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Special Considerations in Using Karnaugh 
Maps 


As can be seen that the more contiguous 1’s on the 
map, the higher chance the Boolean expression can 


be simplified. However, there may be a chance to 
combine scattered cells using negation. For example, 
the Boolean expression being mapped to the 4 
corners in the five-variable expression: A‘B’C’D’E 
‘((0)+AB’CD E (4) + AB’CD E (16) + ABCD E (20) 
can be simplified to BD E’. 


In some cases, we may need to include a cell in 
several terms of the simplified from. This is fine 
because of the sum-of-product form. For example, in 
the Boolean expression ABCD + ABCD’+ A’*BCD’, 
ABCD‘ is used twice in both in simplifying ABCD 

+ ABCD’ and ABCD’+ A’BCD’ as shown in . The red 
oval will lead to ABC whereas the green oval will 
result in BCD’. Therefore, the Boolean expression 
can be simplified as ABCD + ABCD’ + ABCD’ = ABC 
+ BCD’. 


A term in a Boolean expression may contain fewer 
variables. The missing variables will not affect the 
value of the term. Those are “don’t care” to the 
term. In fact, the term A can be expanded to A(B+B 
‘), i.e., AB+ AB’. Since B+B’ is always true, adding 
it to the term A will not change its value. Should 
there are more variables, we can add them in with 
their inverse forms (true). It turns out that the cell is 
filled with an one as long as it belongs to the A area 
in the corresponding Karnaugh map. For example, 
in a four-variable case, the Boolean expression A 
+BC+C can be mapped to the Karnaugh Map 
shown in . It can be simplified to A+ C. 


1(3) 1@2) 1(7) 109) 1(8) (110) 1011) (11.2) 
(1013) CBA 1(6) 1(14) 1(15) (CD Figure A Boolean 
Expression A+ BC+ C with Fewer Variables in the 
Karanugh Map 


Exercise 


Le 


2. 


3: 


Prove x+xy=x+y using direct applications of 
Boolean Algebra laws and truth tables. 

Prove x(x+ y) =xy using direct applications of 
Boolean Algebra laws and truth tables. 

Derive a Boolean expression for a voting 
machine which allows three persons to vote. 
The output is either yes or no dependent on the 
number of persons voting yes. Two restrictions: 
1) No abstain allowed. That is each person has 
to vote either yes or no. 2) The outcome is yes 
if two or more yes votes; otherwise, the 
outcome is no. Normally, when forming a 
voting group, like board of trustee, the number 
of voting members is designated to an odd 
number to avoid a tie situation. Derive a sum- 
of-product format first and then use Karnaugh 
maps to simplify the expression. 


. Simplify the following Boolean expressions 


using Karnaugh maps: 
1. ABCD’+ABCD+ABC’D’+ ABCD 
2. AB’C’D’+A’B’CD’+ AB’C’D’+ ABCD’ 
3. AB’CDE’+A BCD E+ABCDE+AB 


‘CDE’ +A’BCD E’+ ABCD E+A BCD E 
+ A’BCD’E’ 
4. A+ ABC’D’+ AB’CD’ 


Simplification of Boolean Expressions 
Note that infinity is not a number and 0 times 
infinity is undefined. 


2.1 Boolean Algebra Simplification 


The most practical use of Boolean algebra is to 
simplify logic circuits. A Boolean expression can be 
implemented directly in a logic circuit. The number 
of terms and operations in a Boolean expression is 
directly related to the number of logic components. 
Through Boolean algebra simplification, a Boolean 
expression is translated to another form with less 
number of terms and operations. A logic circuit for 
the simplified Boolean expression performs the 
identical function with fewer logic components as 
compared to its original form. Additionally, the 
simplified Boolean expression when implemented to 
a logic circuit is reliable with a reduced cost. 


Boolean expressions may be simplified by applying a 
series of Boolean algebra laws, described in the 
previous chapter. The simplification process is 
similar to that of regular algebra. For example, the 
algebra expression 

(12345/98765 + 67890*23456)*0 is simplified to 0 
based on the algebra rule that 0 times anything 
number is 0 [footnote]. In Boolean algebra, there is 
a corresponding rule called false law, which states 
that false AND anything is false such as Ox=0, 


where x is any Boolean expression. For example, the 
Boolean expression 0(A +BC+)=0, where A, B, C, 
D, E, and F are Boolean variable and 0 indicates 
Boolean false. There are 4 terms in the previous 
Boolean expression. After the false rule is applied, 
the expression becomes one term, i.e., false. On the 
other hand, the true law states that true OR 
anything is true, i.e., 1 +A+B+C=1, where A, B, 
and C are any Boolean expression. 


There may have different ways to simplification 
Boolean expressions by applying Boolean algebra 
law. The following example shows a simplification 
process for a Boolean expression to be evaluated to 
true. 


B+ABC’B+ A’ +BCDeMorgan’s law A’+B°+C 
‘DeMorgan’s law B+ B’+ A’ + C’Commutative law 
(B+B’)+A°+C‘Associative law 1+A°+C’ Inverse 
law 


1 True law 


The above simplification process may not be unique. 
The decomposition of ABC” maybe considered as 
either A-BC’ or AB-C’ where - indicate an AND 
operation. Obviously, either case will lead to A°+B 
“+ C’ but the process is different. 


AB+A‘B+AB’+A‘B(A+A7)B+AB’+A‘B 
‘Distribution law 1B + AB’+A‘B’ Inverse law 


B+AB°+A‘B True law B+(A+A/‘)B’ Distribution 
law 


B+1B’ Inverse law 


B+B ‘True law 1 Inverse law 


2.1.1 Exercise 


Simplify the following Boolean expression using 
Boolean algebra laws. 


1. A+AB’=1 

2. AB‘(A+B’)(B’+B)=A’ 

3. (A+C)(AD+AD‘)+AC+C=A+C 
4. A+AB=A 

5. A(A+B)+(B+AA)(A+B’)=A+B 
6. BC+ BC’+BA=B 


7.A+AB+ABC+ABCD+ABCDE=A+B+C 
=e DARE 


8. AAA+B)=A 


Combinational Logic Design 

In this chapter, we will introduce the basics of 
combinational logic design. One good example of 
the design is Algorithmic Logic Unit (ALU) in each 
kernel of the CPUs. Students will understand the 
principles and methodology of digital logic design at 
the gate and switch level. Gain experience 
developing a relatively large and complex digital 
system using breadboard prototyping. Gain 
experience with modern computer-aided design 
tools for digital logic design. Appreciate methods for 
specifying digital logic, as well as the process by 
which a high-level specification of a circuit is 
synthesized into logic networks. Appreciate the 
tradeoffs between hardware and software 
implementations of a given function. Appreciate the 
uses and capabilities of a modern FPGA platform. 


Combination Logic Design 


Combination logic is a hardware component that is 
composed of basic logical gates such as AND, OR, 
and NOT without any clock involved. Its output is 
fully dependent on the inputs only. Whenever the 
inputs are provided, the output is ready after a 
propagation delay. This delay is the time the signal 
travels from the inputs to the outputs. There may 
have several paths from inputs to outputs and the 
longest one will be the critical path and its time is 


the propagation delay. Normally, the shorter the 
delay, the better in terms of performance. However, 
in order to achieve a shorter delay, more resource 
may be needed. There is always a trade of between 
speed and space. Combination logic is no exception. 


Sum of Product 


When designing a combination logic, we are given 
the relation between inputs and outputs. If the 
input/output relation is not given explicitly, it will 
be derived from the problem statement. The relation 
is then tabularized in a table, which can be used to 
elicit a Boolean expression of the “sum of product” 
form. For example, a half adder will add two bits (A, 
B) and output their sum (S). 
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Figure 1 


Figure The Truth Table of the Half Adder 


The left hand side of the vertical line in Table 1 lists 
the input signals (A, and B) whereas the right hand 
side lists the output (S). From the table, based on 
the second and the third rows, it simply says that S 
is true in two cases: A is false and B is true, or A is 
true and B is false. In Boolean expression, the above 
statement is equivalent to A‘B+ AB’. Therefore, we 
can derive S=A’B+ AB’. 


Once the Boolean expression is obtained, building 
the circuit involves the following steps: 


1. Lay each input variable with a NOT gate for 
negation 

2. Draw each product term using an AND gate 
with corresponding negated inputs 

3. Add an OR gate for all the product terms as the 
output 


The resultant schematic diagram for the half adder 
is illustrated in figure 2. 


Figure 2 The Half Adder Combination Circuit Using Sum of Produc 


Figure The Half Adder Combination Circuit 
Using Sum of Product Approach 


In general, a combination logic with any number of 
variables can be solved using the sum of product 
approach. However, the derived Boolean expression 
may not necessarily be the simplified one, i.e., the 
number of gates may not be minimized. For 
example, the Boolean expression ABC “AE + ABC “DE 
“+ ABCDE’ + ABCDE can be simplified as ABD. That 
will reduce the number of gates to just 2. 


Product of Sum 


If we consider the false cases in the truth table, a 
product-of-sum can be derived. The idea is that the 


output may not be any of those false cases. This 
statement is translated to a product of negated 
terms. The negated terms are applied with De 
Morgan’s law, which results in “sum” terms. In the 
half adder example, the corresponding product-of- 
sum form is derived as follows: 


S=A’B’’AB’=(A+ B)(A’+B) 


The circuit is illustrated in figure 3. 


Figure 3 The Half Adder Combination Circuit Using Product of Sum : 


Figure The Half Adder Combination Circuit 
Using Product of Sum Approach 


When mapping to combination logic, the product of 
sum in the half adder example uses the exact 
number as that in the sum of product approach. 
However, the number of gates of different types 


used is different. Although the two forms are logical 
equivalent, practically, one is preferred due to space 
consideration (some gate occupies more space than 
others), time consideration (some gate is faster than 
others in terms of propagation delay), or simply 
running out of a specific type of gates in some chip. 


A Three-Way Switch Example 


Three-way switches have been used almost in every 
home to control light bulbs in situations where they 
can be turned on or off using either switch at two 
locations, such as stairs that connect two floors. It 
involves two single pole, double throw (SPDT) 
switches, each of which has four connectors 
including ground. This type of switch is different 
single pole switch, which only has three connectors 
including ground. Figure 4 shows typical wiring for 
a three-way switch control. 
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Figure 4 A Typical Three-Way Switch Wirin 


Each of the switches in figure 4 will either connect 
to the upper point or the lower point. Assume sw1 is 
turned up, sw2 is turned up to make a continuity 
loop for the current, or break the loop by turning 
sw2 down. Similarly, if sw1 is turned down, sw2 is 
turned down to make a continuity loop for the 
current, or beak the loop by turning sw2 up. The 
cases for controlling the lamp using sw1 are 
symmetry. 


The circuit is simple but it requires three-conductor 
electrical cable for wiring. For a regular switch, only 
two-conductor electrical cable is enough. Beside, 
both switches have to be the SPDT type. Thus, the 
cost is higher. It would be good if regular SPST 
switches, and a regular two-conductor electrical 
cable were used in this type of application. First of 


all, you don’t need to get extra special SPDT 
switches and three-conductor cables. Second, the 
wiring will be simplified because only one type of 
cable is used. After framing is done in home 
construction, an electrician will simply lay cables 
from any switch box location to the lamp. Third, in 
remodeling old house, or adding extra lamp control, 
the two-conductor cable can be reused. Lastly, the 
overall cost will be reduced because all the material 
is generics. To gain these benefits, a simple digital 
circuit will be designed. 


Design a three-way switch logic to control a light 
bulb using two SPST switches according to the 
following rule. If the light is on, switch any of the 
two switches will turn off the light. If the light is off, 
switch any of the two switches will turn it back on. 
At a first glance, it seems that the two switches can 
be used to turn on or off the light bulb according to 
their previous positions. In fact, we can choose a 
state to start with, say, all switches are off and the 
lamp is off. A truth table is then derived from there. 
tabularizes the truth table of the three-way switch 
control. When both switches are off (case 0), turning 
on either switch will turn on the lamp (cases 1 and 
2). While in case 1, turn on sw1 will turn off the 
light (case 3). Similarly, while in case 2, turn on 
sw2 will turn off the lamp (case 3). It is worth 
mentioning that the lamp changes its state (from on 
to off, or from off to on) only when one of the 
switches changes its position. For example, from 


case 0 to case 1, only sw2 changes its position, and 
swl remains its position. On the other hand, should 
both switches change their positions, the lamp will 
not change its state. For example, the lamp remains 
on from case 0 to case 3, or the lamp stays off from 
case 1 to case 2. 


Table The Truth Table for the Three-Way Switch 
Control 


manwAwn aw. 4 aL. as 
wade IVIL YUVIG ha ilip 
ray OFF OVFE VF 
VW Wik Wik Wik 
1 OFF T°“, T°, 
LL Nvikb Wit NWJiAic 
9 Tyr O”VFE T—’°1,n 
onl NWJLic eee & WiAic 
3 On On Off 


A closer look at the truth table in will find that it is 
identical to an exclusive OR function. Therefore, we 
can derive the following Boolean functions: 


Lamp =sw1@sw2, or Lamp =sw1 ‘sw2+swlsw2’ 


The digital circuit is depicted in figure 5. 
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Figure 5 The Three-Way Switch Control Logi 


Logic Gate Equivalence 


In some chips like 7400 where only NAND (not-and) 
gates are provided, it is practically a need to use the 
NAND gates to build other logic elements. 
Hopefully, this is feasible. First, let’s examine its 
truth table as shown in Figure 4 . 
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Figure The Truth Table of The NAND Gate 


By carefully examining the first row and the last 
row of the truth table, a NOT gate can be derived by 
wiring the two inputs together to force the two 
inputs of the NAND gate have the same value, either 
O or 1. When the two inputs are 0’s, the output is 1, 
and vice versa. This exactly simulates a NOT gate. 
Figure 5 illustrate the NOT gate created from the 
NAND gate. 
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Figure 7 An NAND Gate That Simulates a N 


Once we can build a NOT gate from the NAND gate, 


the AND gate is straightforward as it will be a NOT 
gate right after a NAND gate. The diagram is shown 
in Figure 6. 
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Figure § An AND Gate Built from Two NAND Gat 


To build an OR gate by using NAND gates, we can 
apply De Morgan’s law as follows: 


A+B=A+B°=A’B” which requires three NAND 
gates as shown in Figure 7. 


Figure 9 An OR Gate Built from Three NAND G 


Therefore, the basic logic gates (AND, OR, NOT) 
are built from NAND gates successfully. Other logic 
blocks can be constructed from NAND gates only. 


Exercise 


1. Using only the component NOR (not-or), create 
a circuit that is equivalent to the logic of an 
AND gate, a NOT gate, and an OR gate. 

2. Build three-input AND, and OR gates from two- 
input AND and OR gates plus NOT gates. 

3. Create a two-input adder with two outputs: 
sum and carry. Carry outputs one when both 
inputs are 1’s. 

4. A robot will be designed to move toward a 
light source. There are three photo sensors SL, 


Sc, and Sk mounted on the robot. Si is located 
at the left of the robot and in charge of sensing 
left light; Sc at the front is sensing the front 
light; SR at the right is sensing the right light. 
The two wheels of the robot are powered (PL 
and Pr) individually according to the sensors. 
Assume the sensor outputs and wheel power 
are either on or off, i.e., binary. If only SL 
detects light, the robot should turn left, i.e., the 
right wheel must be powered up (PR). 
Similarly, if only SR senses light, right wheel 
must be powered up (PL). If only the forward- 
pointing sensor Sc detects light, both Pt and Pr 
should be on to drive the robot forward. Design 
a simple logic used to actuate the wheels under 
each sensor condition. Modify such logic by 
using NAND gates only, and minimize the 
number of gates. 

. Design a four-way switch logic to control a 
light bulb using three on-off switches according 
to the following rule. If the light is on, switch 
any of the three switches will turn off the light. 
If the light is off, switch any of the three 
switches will turn it back on. Note that the 
circuit is normally found in controlling a light 
bulb for stairs in-between floors. For example, 
you turn on the light using a switch in the first 
floor and you go upstairs where you can turn it 
off using the other switches. Should there is 
another person entering the stairs, the light will 
be turned on using the switch in the first floor 


again. The three switches can be used to turn 
on or off the light bulb without regard to their 
previous positions. 

. Produce a five-way switch logic to control a 
light bulb using four on-off switches according 
to the following rule. If the light is on, switch 
any of the four switches will turn off the light. 
If the light is off, switch any of the four 
switches will turn it back on. 

. An equivalent circuit for an AND gate is shown 
below: 


State Machines 

This chapter describes states and discuss the concept 
of finite state machines. Describe a computer as a 
state machine that interprets machine instructions. 
Describe computations as a system characterized by 
a known set of configurations with transitions from 
one unique configuration (state) to another (state). 
Describe the distinction between systems whose 
output is only a function of their input 
(Combinational) and those with memory/history 
(Sequential). Explain how a program or network 
protocol can also be expressed as a state machine, 
and that alternative representations for the same 
computation can exist. Develop state machine 
descriptions for simple problem statement solutions 
(e.g., traffic light sequencing, pattern recognizers). 
Derive time-series behavior of a state machine from 
its state machine representation. Design a 
deterministic finite state machine to accept a 
specified language. Explain clocks, Latches, 
sequencing, registers, and memory. 


State Machines 


Introduction 


State machines are used extensively to model a 
system by its states and transitions. In computer 
science, several core fields rely on state machines to 
design, and model system behaviors. These fields 
include software engineering, computer 
architecture, operating systems, compilers, and the 
like. In computer architecture, state machines are 
frequently used to design sequential logic. It’s 
important to learn state machines and their uses in 
almost all areas of computer science. Therefore, in 
theory of computation, state machines play a key 
role. 


State Transition System 


A state transition system is an abstract machine to 
studying computation. It is composed of a set of 
states, and transitions between states. The number 
of states or the number of transitions in a state 
transition system may not be finite or countable. If 
the number of states is finite, the state transition 
system is called finite automata or finite state 
automata. 


The transitions may be labeled. The label is used to 
indicate expected inputs, conditions that trigger the 
transition, or actions performed during the 
transition. For example, a string matcher that 
recognizes the word “ABC” will have 4 states 
including an initial state (sO), and one state per each 


letter (s1,s2,s3. There are three transitions: from 
sO—s1 on reading A, from s1—s2 on reading B, and 
from s2—>s3 on reading C. illustrates the state 
machine for the string matcher. 


Figure The State Machine for a String Matcher 
that Recognizes the Word "ABC" 


Formally, the labeled state transition system is a 
triple (Q,x,5) where Q is a set of states, X is a set of 
labels, or alphabet, and 6 is a transition function 
that maps Q x X to Q, i.e., 5:Q x LQ. In the 
example shown in , Q= {s0,s1,s2,s3}, X= {A,B,C}, 
and 6 is defined as 6(s0O,A) =s1,6(s1,B) =s2, and 
5(s2,C) =s3. 


Finite Automata 


Practically, we limit the number of states to be finite 
in state machines for them to be realizable. State 


machines with finite states are called finite state 
machines. The example, in , is a finite state machine 
because it has 4 states. However, the matcher does 
not output when a string “ABC” is recognized? If the 
system moves states from state sO, initial state, to 
state s3, according to the state transition function 
and corresponding inputs, it actually recognizes the 
string “ABC.” In other words, the system accepts the 
input string “ABC” as long as it is in state s3. The 
state s3 is called a final or accepting state. Note that 
there may have more than one final state. The term 
“finite automata” is referring to finite state 
machines with a set of final or accepting states. For 
example, if we want the matcher to recognize both 
“ABC”, and “AB”, both states s2 and s3 should be 
final states. illustrates an example of a finite 
automaton with 2 final states. 
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Figure The String Matcher with 2 Accepting 
States Marked by Double Circles 


It is worth mentioning that the state transition 
function discussed so far will take the system to a 
specific state according to the input. The type of 
final automata is called deterministic finite 
automata (DAF). Practically, it is convenient to 
model a system with some non-deterministic 
features. Given an input that triggers a transition, 
the system may move to another state or a set of 
states. For example, in the matcher automaton, if we 
want to recognize both “AD”, and “ABC”, we could 
build two state machines, one for “ABC”, and the 
other for “AD,” and link them together at their 
initial states as shown in with final states marked by 
double circles. In this case, at state sO, the system 
could move to state s1 or state s4 on input A. This 
creates a nondeterministic finite automaton (NFA). 
With this non-determinism, one would reuse as 
many states as possible in modeling a system with 
more expressive power. Plus, each NFA is equivalent 
to a DFA. 
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Figure An Example of Nondeterministic Finite 
Automata 


Formally, the deterministic finite automaton A is 
defined as a five-tuple 


A=(Q,2,6,g0,F) where 


1. Q is a finite set of states, 

2. XL is a finite set of labels or symbols, 

3. 6 is a transition function, 6:Q x XQ, 

4. qO€Q is an initial state, and 

5. FCQ is a set of finite or accepting states. 


It can be observed that the definition of the NFA is 
the same as the DFA except the transition function. 
Instead of mapping to a specific state in DFA, the 
transition function maps to a set of states, 
specifically, a subset of Q. Thus, the transition 
function of NFA is defined as follows: 


5:Q x X—2Q where 2Q denotes subsets of Q. 


The SR Latch 


A hardware implementation of finite automata 
requires state transition logic and a state variable 
that keeps what state is at each moment. The state 
variable is implemented using a register typically. A 
register is a basic sequential logic, and it is based on 


a hardware latch. A latch is like a bi-stable multi- 
vibrator, an electronic circuit that has two stable 
states, and thus can be used to store one bit 
information. Normally, latches are transparent 
storage devices whereas flip-flops are non- 
transparent ones with triggering clocks. 


The SR NOR Latch 


The output of a latch not only depends on its inputs, 
but also its current state. The most fundamental 
latch is SR latch that can be built from static logic 
gates. The letter “S” stands for set and the letter “R” 
denotes reset. Two NOR gates (not OR) are cross- 
coupled to build a SR NOR latch as depicts in . The 
outputs are Q and its complement Q’. Given the 
inputs S=1,R=0, There are two cases: 1) when 
Q=1(Q’=0) or 2) Q=0(Q’=1). Let’s use the 
aggregated notation for SRQQ’ to derive the value 
change step by step. When Q=1, the signals SRQQ’ 
change from 10101010. Actually, all signals are 
not changed in this case because the set signal will 
set Q to true but Q is already true. So nothing got 
changed. When Q=0, the SRQQ’ changes from 
100110101010. All the signals are stabilized at 
1010. This result shows that the set operation will 
make Q true if it is not. 
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Figure The SR NOR Latch 


Assume that the inputs S=0 and R=1. We again 
have to consider two cases: Q=0, and Q=1. When 
Q=0, the aggregated signals SRQQ’ will change 
from 01010101. As a matter of fact, the outputs 
are not changed at all. The reason is because the 
reset operation on a state that Q is ready zero will 
not change anything. When Q=1, the aggregated 
signals change from 011001000101. The signals 
stabilize at 0101, i.e., the output Q has been reset to 
zero. Note that the transit value 0100 is not stable 
and is caused by the propagation delay. It also 
violates the complement outputs. Hopefully, it is 
just a short transit value. 


Given the inputs S=0 and R=0, the output should 
not be changed, i.e., the latch keeps its information. 
To find out why? Recall that the false law in 
Boolean algebra states 0+ x=x. Since both S and R 


are zero, each of the NOR gate will be dominated by 
the inverse of either Q or Q’. Let’s examine the 
upper NOR gate in: Q+S°=Q+0°=Q,, which is 
the output of the upper NOR gate. The relation of 
inputs and output of the lower NOR gate: Q°+R 
“=Q°+0°=Q°*’=Q. It is concluded that the outputs 
Q and Q’ will not change when both inputs S and R 
are zero. In other words, this combination of inputs 
to the SR NOR latch will allow it to keep its current 
state. 


The most interesting combination of inputs is when 
both S and R are one. Once again, there are two 
cases to be analyzed: Q=0 and Q=1. When Q=0, 
the aggregated signals change from 
1101511001100. The outputs are stabilized at 
both Q and Q‘ are zero. It violates our assumption 
that they have to be complement to each other. On 
the other hand, when Q=1, the aggregated signals 
change from 11101100. Again, the output violates 
the assumption. Therefore, this combination of 
inputs is called restricted combination, which results 
in a forbidden state. This restricted combination 
may be converted to one of the three non-restricted 
combinations by adding more preceding logic 
circuit. For example, if the restricted combination is 
converted to keep the latch state, the preceding 
logic will be two NOT gates and two and gates such 
as S'=SR’ and R'=S’R where S' and R' are the new 
inputs to the latch. 


Like the truth tablet that describes the relation of 
inputs and outputs of a combinational logic, a state 
transition table or characteristic table for the SR 
NOR latch is shown in . The state transition table is 
essentially a truth table that includes inputs (some 
feedback outputs) and output of a sequential logic. 
The next state of Q is denoted by Q'. Note that the 
inputs of the latch in the truth table should include 
Q and Q’ but it will contains 8 rows (not 16, why?), 
which is tedious as some of the cases are duplicated. 
Therefore, by using Q and its next state Q', the truth 
table is succinct. 


Table The State Transition Table of the SR NOR 
Latch 
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The SR NAND Latch 


Practically, NAND gates are cheaper than NOR gates 
in most semiconductor devices. Therefore, an 
equivalent latch to the SR NOR latch may be built 
from NAND gates. illustrates the SR NAND latch. 
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Figure 5 The SR NAND Latch 


Figure The SR NAND Latch 


The operation of the SR NAND latch is a little bit 
different from that of the SR NOR latch. When both 
S and R are 1, the SR NAND latch will keep its state. 
Moreover, the restricted input combination for the 
SR NAND latch is when both inputs are zero because 
that will lead to both Q and Q’ are 1, which violates 
the complement assumption. If the inputs are 
negated and the outputs are swapped, the SR NAND 
latch is operated identically as the SR NOR latch. 
The truth table of the SR NAND latch is shown in . 


Table The Truth Table of the SR NAND Latch 
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The Gated SR NAND Latch 


The inputs to the SR NAND latch may be controlled 
by AND gates in a way that only allows the latch to 
change its state in a controlled manner. This is a 
typical practice in digital design where an enable 
signal is designated. If the enable is asserted, the 
latch works as normal. Otherwise, it will keep its 
state. illustrates a gated SR NAND latch. 
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Figure 6 A Gated SR NAND Latch 


Figure A Gated SR NAND Latch 


In the gated SR NAND latch, the first stage of the 
circuit is implemented by two NAND gates. The rest 
of the circuit is the same as the original SR NAND 
latch. When the EN is 0, the inputs to the SR NAND 
latch are both 1’s, which will keep the state of the 
latch as described in the SR NAND latch section. 
When EN is 1, the inputs to the SR NAND latch are 
S and R, which is identical to the original SR NAND 
latch. Therefore, the gated SR NAND latch acts as if 
there were no gated control stage when the EN is 1. 


The D-Type SR NAND Latch 


The EN signal may be treated as a clock. When it is 
low, nothing got changed whereas when it is high, 
the latch changes its state according the inputs. 
However, the Gated SR NAND latch does not 
exclude the restricted inputs, i.e., both the inputs 
are 0’s. The easiest way to exclude the restricted 
inputs is to force the inputs to be either 1 and 0, or 
0 and 1. This in effect will lead to just one input as 
shown in. 
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Figure 7 A D-Type SR NAND Latch 


Figure A D-Type SR NAND Latch 


We could add a NOT gate to restrict the inputs on 
top of the gated control. However, the negated 
signal can be obtained from the bottom NAND gate 
that receives data input D. Thus, the NOT gate is 
saved. The rest of the circuit is similar to the gated 
SR NAND latch. If the EN signal is required, an AND 
gate may be added to control the clock, which 
allows the clock to enter the circuit when the EN is 
1, and blocks the clock when the EN is 0. 


The SR NOR Latch as a Finite State 
Machine 


Because of the nature of the SR NOR Latch, it can be 
described by a finite state machine. To model the SR 
NOR Latch by a finite state machine, we will first 


find out what states it has. Perhaps, it is hard to 
identify all the states at first place. A rule of thumb 
is list all variables except inputs, and each 
combination of their possible values is a state. In the 
SR NOR latch, there are three variables S,R and Q. 
Q’ does not count because its value is determined by 
Q. Since S and R are inputs, the only variable left is 
Q. The possible values for Q is 0 or 1. Therefore, 
there are totally two states: Q=0 (denoted by state 
0) and Q=1 (denoted by stated 1). 


What’s next is find out the state transition function. 
Excluding the restricted inputs, the inputs SR could 
be 00, 01, or 10. When SR=00, no state changed, 
i.e., a transition from whatever state to itself. When 
SR=01, the next state will be state 0 because of the 
reset operation. The leads to two transitions: from 
state O to state O (already reset), and from state 1 to 
state O (reset). When SR=10, the next state will be 
state 1 because of the set operation. Similarly, this 
leads to two transitions: from state 0 to state 1 (set), 
and from state 1 to state 1 (already set). shows the 
state transition diagram with the aggregated singles 
of S and R as inputs. 


Figure 8 The State Transition Diagram for the SR NOR Latch 


Figure The State Transition Diagram for the SR 
NOR Latch 


In the , we don’t call it finite automata simply 
because it does not define final states, and an initial 
state. However, the state transition diagram does 
describe the behavior of the SR NOR latch, and it 
provides a visual representation of the SR NOR latch 
as a finite state machine. The state transition 
diagram may indicate outputs via either transition 
or state. Typically, a slash is used to denote an 
output. For example, an output associated with a 
state would be written as sO/0, i.e., the system 
outputs O at state 0. An output would be associated 
with a transition as s010/1-—s1, i.e., on input 10, 
the system changes state from sO to s1, and outputs 
1. illustrates the state transition of the SR NOR latch 
with annotated outputs associated with transitions. 
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Figure 9 The State Transition Diagram for the SR NOR Latch with Annotated Outputs 


Figure The State Transition Diagram for the SR 
NOR Latch with Annotated Outputs 


Acceptors/Recognizers 


If finite automata are defined to generate an output, 
say “yes”, when it moves from the initial state to a 
final state, they are called acceptors or recognizers. 
This means, they accept something or recognize 
something. In the string matching example, we will 
elaborate it to be a recognizer here. The problem 
statement is that given an input string of length 3, 
design a finite automaton that recognizes the word 
“ABC.” In the state machine shown in , what if the 
incoming character is not “B”? The state machine 
will stay at state s1 until the input is “B.” It means 
that any letters in between “A” and “B” will be 
skipped. This will result in accepting an input that 
starts with “A”, followed by any letters other than 


“B”, and followed by “BC”, will still be accepted. 
Therefore, the state machine will have to be 
redesigned to take care of the skipping letter error. 
shows the finite automaton for the acceptor that 
recognizes the word “ABC.” In this design, a failure 
state (s4) and three more transitions are added to 
take care of the skipping letter case. 


Figure An Acceptor/Recognizer for the String 
Matcher that Recognizes "ABC" 


Formally, each component of a finite automaton has 
to be specified. In the above example, the finite 
automaton is defined as follows: 


A=(Q,2,6,q0,F) where 

Q={s0,s1,s2,s3,s4},u =letters,qO =s0,F = {s3,s4}, 
and 6 is the transitions defined in . We can simply 
associate a “yes” answer when the finite automaton 
is in state s3, and “no” to other states. Therefore, 
one bit output is enough. Note that the acceptor 


outputs “yes” at state s3, and “no” at state s4 but no 
output for any other states. Therefore, the output 
should be at least two bits to cover the case that the 
acceptor is during “computation.” However, 
employing two bits as outputs will make the state 
machine as a transducer, which will be discussed 
later. 


Moore and Mealy Machines 


Finite state machines that produce outputs (other 
than just 0 or 1 in an acceptor) based on current 
inputs and/or states are called transducers. These 
outputs are used to control other components in a 
system. For example, a control unit in a pipelined 
CPU belongs to this category. Depending on what 
the outputs are based, there are two types of 
transducers: Moore machine and Mealy machine. 
The outputs of a Moore machine are based on the 
states whereas those of a Mealy machine are based 
on both the inputs and the states. 


In a Moore machine, a state decoder is used along 
with other combinational logic to generate outputs 
accordingly. When implemented in hardware, a 
state register is used to keep track of the current 
state in a system. A finite state machine with n 
states will need a register of |lgn]| bits, where lg 
denotes the logarithm for base 2, and the ceiling 
function ([ ]) maps a real number to the smallest 


integer not less than it. For example, a finite state 
machine of 8 states will need a 4-bit register 
because lg8 = 4. In the example in , if we want to 
differentiate the states in computation (not s3 or 
s4), there would have at least a two-bit output. This 
will make it a Moore machine. The truth table of a 
state decoder used to generate the two-bit output is 
shown in . Since there are 5 states, the state register 
will have 3 bits. 


Table The Truth Table of the State Decoder of a 
Moore Machine for the String Recognizer that 
Accepts "ABC" 
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The 3-bit state register is denoted as R2R1RO, and 
the 2-bit output is represented as C1CO. The output 
is defined as 00 for computation, 01 for failure, and 
10 for success in recognizing a string “ABC.” Based 
on the truth table, the combinational logic for the 


state decoder is derived by the two Boolean 
expressions: C1 =R2°R1RO and CO=R2RI1RO. It can 
be seen that the outputs of the Moore machine are 
dependent on the state register only. From the 
perspective of the state transition diagram, the 
Moore machine is modeled by the state transition 
diagram with outputs associated with the states. 


If the outputs of a finite state machine depend not 
only on the current state, but also the current 
inputs, the finite state machine is called Mealy 
machine. The Mealy machine in its state transition 
diagram will have its outputs associated with its 
transitions. For example, given an input string with 
any length, transcribe it to another string that 
records an 1 on each occurrence of the pattern 
“101.” Therefore, the string “10101” will be 
transcribed to “00101” as there are two occurrences 
of “101.” In other words, once the pattern is found, 
the finite state machine does not reset to the initial 
state, and keeps searching for the next occurrence. 
illustrates the Mealy machine for the pattern 
transcriber. The leftmost 1 in the input 10101 will 
move the Mealy machine from sO to s1, and output 
0. While on the state s1, the next 0 will cause the 
transition from s1 to s2, and outputs 0. The next 1 
one move the state from s2 to s1, and outputs 1, 
because an occurrence of “101” is found. The next 0 
will move the state from s1 to s2, and outputs 0. 
Finally, the last 1 will move the state from s2 to s1, 
and outputs 1. Therefore, the collected output will 


be 00101. The output of this Mealy machine 
depends on both the states and the inputs. On state 
sl, for example, with the input 1, the output 
depends on what current state the Mealy machine is 
at. If the current state is s2, the output is 1; 
otherwise, the output is 0. 


0/0 


Figure An Example of a Mealy Machine that 
Transcribes 101 to 001 


A typical implement of the state transition logic in a 
Mealy machine will include the output logic for the 
reason that the outputs are a function of the inputs 
and the state register. Sometime, this 
implementation may result in fewer logic gates. 
However, the outputs may be perturbed by the 
inputs. It is typically useful to keep the outputs in 
registers to stabilize the circuit. 


Algorithmic State Machines 


Finite state machines are powerful to model a 
system in terms of states and their transitions. 
However, the timing and the algorithm for the 
outputs are not expressed explicitly. One of the 
methods in designing finite state machines with 
detailed algorithms and timing information is called 
algorithmic state machine (ASM). An ASM chart is 
used to informally describe the sequential 
operations of a digital system. Like a flowchart, an 
ASM chart is composed of three types of basic 
elements: state box, decision box, and conditional 
output box. State box is represented in a rectangle; 
decision box is denoted in a diamond; an oval is 
used to indicate a condition output box. A state box 
in an ASM chart contains one state in a regular state 
transition diagram or a finite state machine, in 
which the Moore type of outputs are listed. The 
name of a state box is placed on the top left corner 
outside the box. shows a state box in an ASM chart. 


State name | 


Moore type output 
signals 


| 


Figure 12 A State Box in an ASM Chart 


Figure A State Box in an ASM Chart 


A decision box represented in a diamond in an ASM 
chart is used to conditionally transfer between two 
states, or a state and a conditional output. The 
decision box is associated with one input and two 
outputs (true and false). Within the decision box, 
the condition is expressed by a Boolean expression 
that contains inputs from the finite state machine. 
illustrates a decision box in an ASM chart. 


! 


Mealy type 
conditional outputs 


, 


Figure 14 A Conditional Box in an ASM Chart 


Figure A Decision Box in an ASM Chart 


An oval indicates a conditional output box which 
describes Mealy type outputs in an ASM chart. 
These Mealy type outputs depend not only on the 
state, but also the inputs of a finite state machine. 
shows a conditional box in an ASM chart. 


entity SimpleFSM is 
Port ( Di, clock, reset : in STD LOGIC; 
Do : out STD LOGIC); 
end SimpleFSM; 


architecture Behavioral of SimpleFSM is 
-- user defined enumeration type 
type state type is (s0, sl); 
signal c_ state: state type; 
begin 
asm: process(reset, clock, Di) 
begin 
if reset='1' then 
c state <= s0; 
elsif clock='1' and clock'event then 
case c state is 
when s0 => 
c_state <= sl; 
Do <= '1'; 
when sl => 
if Di='1' then 
c_ state <= sl; 
Do <= '1'; 
else 
c state <= s0; 
Do <= '0'; 
end if; 
end case; 
end if; 
end process; 


end Behavioral; 


Figure A Conditional Box in an ASM Chart 


Timing in an ASM chart is implicitly expressed by 
the state box and its associated combinational logic. 
On each clock (normally rising edge), a state 
transition is taken, i.e., from a state box to another 
state (or the same one) box in an ASM chart. The 
period of the clock has to be long enough to 
accommodate the propagation delay of the 
combinational logic associated with any state box. 
Otherwise, the circuit may not be operated correctly 
because of wrong timing. gives an example of a 
simple ASM chart, where the “: =” denotes an 
assignment operator, which is different from the 
comparison operator (“=”). 


Figure An Example of a ASM Chart 


In , there are two state boxes, and one decision box. 
The names of the state boxes are sO and s1, 
respectively. Inside the state boxes, the variable Do 
(D-Out) is the only Moore output, and it is normally 
implemented as a register. The value of Do 
dependent on the current state, and is 0 in sO, and 1 
in s1. In the decision box, the input Di is checked. 
Based the value of the input Di, the current state 


will be transfer to sO when Di=0 or s1 when Di=1. 
On each clock, the state transition is triggered. 
Therefore, the sequence of state transitions when 
the input Di=1 is s0sl1—s1—..., and the finite 
state machine stays at s1 forever. On the other 
hand, the sequence of state transitions when the 
input Di=0 is sOs1—s0—s1—s0-—..., and the 
finite state machine vibrates between sO and s1. 
This ASM chart actually specifies a simple digital 
circuit that can be used to design a flashing light. 
The flashing frequency is equal to half of the clock 
frequency. 


The ASM chart shown in may be directly 
implemented in hardware using VHDL as listed in . 
The SimpleFSM has two extra inputs (clock and 
reset) in addition to Di. The clock is used to trigger 
state transitions, and the reset is for circuit 
initialization. The state transition logic and output 
decoding are implemented in a sequential VHDL 
process. The user-defined enumeration type is 
defined for the implementation of the state register 
(one-bit in this case). A case statement is used to 
implement state transition logic. Note that the 
output Do is placed in a register and synchronized 
with the rising edge of the clock. This example 
shows a typical FSM implementation in VHDL. 


Table The VHDL Code for a Simple FSM 


entity SimpleFSM is 
Port (-Dis CLeGk; PESet. s 2n-- STD. hOGLCs 


Do : out STD_LOGIC); 

end SimpleFSM; 

architecture Behavioral of SimpleFSM is 
-- user defined enumeration type 
type state_type is (sO, sl); 
Signal c_state: state_type; 
begin 

asm: process(reset, clock, Di) 
begin 

1f reset='1' then 

c_state <= s0; 

elsif clock='1' and clock'event then 
case c_state is 

when sO => 

c_state <= sl; 

Do <= '1'; 

when sl => 

if Di='1' then 

c_state <= sl; 

DO <=. Yb 

else 

c_state <= s0; 

Do: <= "0.5 

end af? 

end case; 

end: ai 

end process; 

end Behavioral; 


The simulation of the SimpleFSM is illustrated in . 
The waveform confirms that the output Do is 


vibrating between the two states when the input Di 
is O (before 50 ns). The output Do keeps high after 
the first rising edge of the clock after 50 ns when Di 
is high. 


Figure The Simulation Waveform for the 
SimpleFSM 


Let’s now consider a slightly complex example: a 
traffic light on a crossroad with its ASM chart 
depicted in . State names are encoded by two letters 
in RGY, indicating red, green or yellow lights. The 
first letter is for vertical road and the second is for 
the horizontal road. For example, the RG state 
represents a red light on the vertical road, and a 
green light on the horizontal road. First of all, the 
timer input is set to 60, and the system enters the 
RG state, in which the timer is updated. Second, the 
value of the timer is checked. If it reaches zero, the 
system moves to the RY state with timer set to 15. 
Otherwise, the system remains in the state RG with 
timer counting down by 1. When in the RY state, 
the system stays at the RY state for 15 clocks, and 
moves the GR state. 


RG Vv 


>_>) Timer_o := Timer_i 


Timer_i := Timer_o-1 
aa 
—<— 


Timer_i := 15 
Timer_i := 60 
q » RY Vv 


GR Vv Timer_o = Timeri <——___ 


Timer_o := Timer_i 
Timer_i := Timer_o-1 
F 
Vv 


Figure The Partial ASM Chart for a Traffic Light 


There are two variables used in the ASM chart: 
Timeri and Timero, which are the input and the 
output to a countdown timer. The value of the timer 
is updated by the input Timeri on each clock cycle. 
This input however is not from outside the finite 
state machine. So Timero appearing in the states 
RG, RY, and GR, is considered as a Moore type 
output. Timeri, however, does not depend on the 
current state only. In the state RG, for example, 
Timeri could be either 60 or some value between 0 
and 59, depending on whether it is coming from the 
top conditional output box, or the left conditional 
box. Therefore, Tiemri is considered as a conditional 


output, i.e., a Mealy type output. 


Summary 


Perhaps, finite state machines are the most 
important computation model as they completely 
capture the behavior of a digital system. They have 
been used in multiple fields in computer science 
such as computer architecture, operating system, 
compilers, languages, software engineering, and the 
like. Especially, in computer architecture, finite 
state machines are used to design the sequential 
operations of a hardware component. 


A formal definition for finite state machine leading 
to a finite automaton includes a set of states, initial 
state, alphabet, a transition function, and a set of 
final states. There are two types of finite automata: 
deterministic finite automata, and non-deterministic 
finite automata. On an input, the DFA can only 
move to a specific state, whereas an NFA may move 
to a set of states. This non-deterministic 
characteristic simplifies the modeling for a complex 
system. The NFA and the DFA are theoretic 
equivalent. 


Finite state machines can be directly implemented 
by sequential logic (states) and combinational logic 
(state transition logic). Two types of typical 
implementations are the Moore machine and the 


Mealy machine. The Moore machine generates 
output s purely based on the current state, whereas 
the Mealy machine produces outputs based on both 
inputs and states. In a state transition diagram, the 
Moore type outputs are associated with the states, 
and the Mealy type outputs are associated with 
transitions. 


To capture the timing information and detailed 
output generation logic, algorithmic state machine 
is a method that addresses the deficiency in finite 
state machines. The ASM chart is similar to a 
flowchart, and describes the finite state machine on 
a clock basis. Both the Moore type and Mealy type 
outputs are denoted within a state box, anda 
conditional output box, respectively. Variables in 
state boxes are normally implemented in registers as 
their values change according to states, whose 
transitions are triggered by clocks. 


Exercise 


1. Given an input string of any length, design a 
finite state machine that outputs a binary string 
of the same length, where the one’s indicate 
matches the word “SPSU.” For example, if the 
input is “AbCSPSU1234SPSU”, the output will 
be “000000100000001,” where the one’s are 
lined up with the letter “U.” 

2. Implement the FSM for the staring matcher 


described in problem 1 in VHDL, and simulate 
the circuit. 


Introduction to Computer Programming 

This chapter gives a high level overview of 
computer programming. A variety of assembly 
programs are introduced. Assembly programing is a 
efficient way to under how hardware works. 


Introduction to Computer Programs 


A computer program is a set of instructions that 
solves a problem. There are 4 levels: 1) high-level 
programs such as Java, Basic, C#, Perl; 2) mid level 
programs such as C language; 3) low level programs 
such as assembly languages; and 4) machine level 
programs such as binary machine code. The higher 
level the programs are, the easier for human beings 
to write. Under very few occasion do we write a 
program in binary machine code manually. The 
process of translation programs from high-level to 
low level is called compilation, and it is typically 
done by a compiler. 


Programming Languages 


In order to handle machine code, each instruction is 
associated with a mnemonic name, e.g., ADD is 
associated with the addition instruction. An 
assembly language is written using the mnemonics 


and some labels to represent addresses. These 
mnemonics are sometimes called symbolic coded 
instruction. They are basically machine instructions 
and referenced by their mnemonics for being easily 
remembered by human beings. Since an assembly 
language is bound to its underlying hardware, it is 
dependent to a specific architecture. Thereby, each 
processor has its own assembly language. A program 
written in one assembly language is hard to be 
ported to another processor. Nevertheless, assembly 
language though tedious got a full control of 
hardware. Most system software has its critical 
components written in assembly for a better 
performance. 


Above the assembly is the C language, which 
typically is classified to mid level language. The mid 
level language provides some low level hardware 
control along with high-level abstraction for solving 
problems. In C language, the powerful pointer 
arithmetic gives programmers a very flexible way of 
memory manipulation. Yet, its high-level 
characteristics allow programmers to implement 
sophisticated systems. As an example, most of the 
contemporary operating systems are written in C 
language. 


High-level languages are designed to be the highest 
abstraction for solving programs. Since the main 
purpose of the high-level languages is for problem 
solving, some of the hardware details are hidden to 


the programmers. A programmer needs not be 
aware of the hardware features when programming. 
In doing so, the programmers may focus on how to 
solve the problem, instead of how to write a 
program to solve it. High-level languages include 
Java, Basic, Fortran, C+ +, and script languages 
such as Perl, Bash, Python, and the like. 


Most of the program languages share some 
characteristics such as syntax and semantics. 


Compilation Versus Assembly 


Computer hardware can only understand and 
execute binary machine code. Programs written in 
non-machine code have to be converted to machine 
code of a machine. Normally, a compiler is used to 
translate a program, written in a high-level 
language, to its equivalent assembly code. An 
assembly program is then converted to machine 
code for execution. The process of translating a 
program written in a high-level language to 
assembly program is called compilation. An 
assembler is then used to convert the assembly 
program to machine code. Compilation involves a 
series of processing including lexical analysis, syntax 
analysis, error recovery, scope analysis, 
optimization, and code generation. Compilation 
could be very tedious and difficult subject to the 
complexity of a language. Assembly languages are 


typically employed a table lookup method to replace 
assembly mnemonics with actually machine code. 
Therefore, the design of an assembler is less 
complex than that of a compiler. table 1 shows the 


relation of compilation and assembly. 
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table 1 The Relation of Compilation and Assembly 


By and large, programs written in assembly 
language have a higher performance than those 
written in high-level languages. However, assembly 
programming is tedious and error-prone. It would 
be very unproductive should a software project is 
fully in assembly programming. A program follows 
the 20/80 rule, i.e., 80% of the program execution 
time is spent in the 20% of the code. Based on the 
20/80 rule, though the whole program may not 
fully be written assembly, it would be beneficial if 
the 20% of the code is written in assembly. This 
leads to a hybrid project, i.e., the system is 
developed in both assembly and high-level 
languages. For example, a project written in an 
assembly language requires 100 programmer- 
months whereas it requires only 20 programmers- 
months using a high-level language. However, the 
program written in an assembly language runs 20 
seconds but 200 seconds if developed in a high-level 
language. If the program follows the 20/80 rule, the 
project could be developed in a hybrid way, in 
which the critical portion (20%) may be developed 
in an assembly language, and the rest is developed 
in a high-level language. Table 1 shows the 
scenario. 


Table 1 A Comparison of a Project Written in 
Assembly Language, High-Level Language, or Both 


Project Type Programmer- Execution Time 


Ly ew 5 an J lannne aAn\ 
AVE'JLILELLO) LoOULTVLiUuD) 
Fully in 100 20 
Assembly 
Fully in High- 20 200 
Level Language 
Hybrid 36 (20 assembly 56 
and 16 high- 
level) 


With the extra 16 programmer-months on top of the 
pure high-level programming project, the execution 
time will be improved to 56 seconds from 200 
seconds. Overall, the hybrid approach shows a cost 
effective way of the project implementation. 


The 20/80 rule is a statistical consequence. In 
reality, there are occasions that assembly will be the 
best choice. If the speed or the size of a program is 
critical, it is hard to achieve this design constraint 
using a high-level language. Compilers may be 
developed in a general architecture that neglects 
subtle device specific features. Assembly 
programming provides a full control of instruction 
set architecture, and takes advantage of each unique 
feather in a processor. If a system has to use those 
special features, it is inevitable to use assembly 
programming. However, assembly programming is 
so close to the machine code, and it is typically lack 
of programming structures. An assembly 
programmer will have to code in a consistent way to 


organize the code. Otherwise, the code is getting 
bigger, it becomes harder to debug. Since assembly 
programming is machine specific, i.e., the 
instructions are bound to the underlying machine 
architecture, it may not be easily portable. 


Compilation 


A compiler is basically a text translator. Its input is 
a text (a set of strings), and the output is another 
text. In this sense, an assembler is a compiler 
because it converts the assembly program to its 
machine code (program). However, we normally 
call assembler given the fact that there is not much 
compilation for translating an assembly program to 
machine code. An assembler actually “assembles” 
instructions according to mnemonics. Compared to a 
full cycle of compilation, assembly is far simpler 
than compilation. 


In a system, when a program source changes, the 
course code has to be recompiled. The 
recompilation takes some time for a huge 
monolithic program source. Therefore, a large 
software project is sometimes organized into a 
number of smaller pieces. Should some of the pieces 
are modified, a compiler only need to recompile the 
modified pieces. For example, the “Make” utility in 
Unix systems keeps track of which pieces in a 
project have been modified and recompile them 


when necessary. In a C project, source code is 
organized into headers, and source files. Each 
source file may include headers. The source file will 
be compiled to an object file. The Make utility uses 
a rule that setup dependency and actions to build a 
target like the following 


Target: Dependency 
[TAB ]Command 
[TAB ]Command 


The target of the Make rule indicates the output file 
generated by the series of the commands. The 
related headers or other objects that have to be built 
before this target are listed in the dependency. Each 
of the commands in the Make rule must be followed 
by tab key and must be in a line. For example, a 
main.c includes global.h, and the following Make 
rule will recompile the source main.c to main.o if 
either global.h or main.c has been modified. 


main,o: global.h main.¢c 
jee =e Nain. 


Note that the Make utility will check the time 
stamps of main.o, main.c, and global.h. If the time 
stamps of main.c or global.h are newer than that of 
the main.o, the command of the Make rule will be 
activated. The fact is that time stamp of main.o 
should be newer than any of the other two files if 


they have not been modified since the last 
compilation! The Make utility is the fundamental 
software management tool, and a lot of derivations 
have been developed such as GNU’s automake and 
autoconf tool chain. These tools further simplify the 
creation and maintenance of Make rules. 


A compiler will compile program source into a 
specific machine code. The machine code can only 
be executed in the specific machine. Should the 
same program will be executed on another machine, 
another compiler is required to recompile the whole 
thing. Program source that can be compiled to other 
machine code is portable. Since portable code may 
not use machine dependent instructions, 
performance-wise it is normally not quite as good as 
some customized code. Compilation may not be 
easily done on a machine because it involves a 
variety of factors such as compiler versions, 
libraries, tools, and the like. 


Interpretation 


Are there programs that may be executed directly? 
The answer is yes. Programs written in the original 
Basic language may be executed in a Basic 
interpreter. An interpreter is an execution engine 
(an executable program) that reads the program 
source line-by-line and executes it. Java, as another 
example, is an interpreter language where Java byte 


code is interpreted by a Java virtual machine. A 
Java program is first compiled to Java byte code 
(machine code ready to be executed in a Java 
virtual machine). The Java byte code is then 
executed by a Java virtual machine. Script 
languages such as Perl scripts are interpreted by a 
Perl interpreter. If we push to the low-level 
languages, hardware is an interpreter for machine 
code. There may not have a clear line to separate 
interpreter languages from compilation ones for the 
fact that a compiler may be easily developed for an 
interpreter language. For example, there are Basic 
language compilers that translate Basic programs to 
executables, which no longer require a Basic 
interpreter for their execution. Thereby, we may 
roughly say that a language is an interpreter 
language if programs written in that language may 
be executed directly by an interpreter without 
compilation. However, Java is kind of something in- 
between because Java source has to be compiled to 
byte code, and Java virtual machine will executed 
byte code not the Java source. 


One of the major advantages of the interpreter 
languages is that the program source can be 
interpreted (executed) on any machine that has a 
running interpreter. The lack of compilation makes 
the program source portable. Here the concept of 
portability is a little bit different from that in 
compilation. A program written in an interpreter 
language may be interpreted on any machine 


without any effort. However, a program written in a 
compiler language would need to be compiled for a 
machine. The compilation requires some effort. This 
effort of compilation is absorbed by the design and 
implementation of an interpreter. Thus, Java virtual 
machines have a variety of versions, Windows, 
Linux, etc. Java programming, however, is portable. 
Systems developed in Java can be ported to any 
platforms effortlessly. 


Assembler 


An assembler will convert mnemonics (symbolic) 
instructions to binary machine code. This process 
will transcribe mnemonics names to op code, assigns 
labels to memory addresses, executes directives 
(e.g., DB allocates one byte memory space), strips 
out comments after semicolon, and calculate 
addresses (e.g., PC relative). It is not necessary to 
run an assembler on a machine on which the output 
machine code will be executed. In fact, an assembler 
may be run on any machine, though most of the 
time we use the same machine to run assembler and 
to run the machine code generated by the 
assembler. 


x86 Assembly Programming 


Most of the processors manufactured by AMD and 
Intel follow the x86 architecture, which is based on 
the Intel 8086 processor. Let’s first examine 
available registers, followed by instructions, and 
addressing modes. 


Registers 


Registers in the x86 family include 16-bit, 32-bit, 
and 64-bit versions. There are general purpose 
registers and special purpose registers. Table lists 
the 16-bit registers and their purposes. Table 2 lists 
the four general purpose registers. 


Table 2 The 16-Bit General Purpose Registers (16- 
bit) in the x86 Architecture 


Register Purpose 
AX Primary accumulator used 
for I/O and arithmetic 
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anaratinna 
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BX Base register used to hold 
SLAVIA AL LEAL ULL wyvitlb: 
CX Counter register used for 


TaAnn nantra 
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DX Data register used for I/O 
operations. 


Each of the 16-bit general purpose registers (AX, BX, 
CX, and DX) may be accessed as two separate bytes. 
For example, the high byte and the low byte of AX 
can be accessed via AH and AL, respectively. Table 
3 lists the 16-bit special purpose registers in the x86 
architecture. 


Table 3 The Special Purpose Registers (16-bit) in the 
x86 Architecture 


Registei Purpose 
SP Stack pointer which 
points to the top of the 


atanl, 
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BP Base pointer which points 
to the base address of the 
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SI Source index for string 
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ES Extended data segment 
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FLAGS Flag register holds 
operation status such as 
carry flag, overflow flag 


Unt aeLty 11U6 


IP Instruction pointer 
(program counter) points 
to the address of the 
instruction to be executed 
next. 


There are other registers in different x86 processors. 
For example, in the 80286, there are special 
registers to hold descriptor table addresses such as 
global descriptor table register (GDTR), local 
descriptor table register (LDTR), interrupt descriptor 
table register (IDTR), and task register (TR). 


In the 32-bit 80386 processor, most of the 16-bit 
registers, except segment register, are extended to 
32 bits. A prefix “E” is added to the registers to 
differentiate them from their 16-bit versions. For 
example, EAX is the 32-bit version of AX, EIP is the 
32-bit version of IP, and so on. There are other 
registers added in the 32-bit x86 processors as listed 
in Table 4. 


Table 4 Extra Registers in the x86 Architecture 


Regis ter 
ES 


GS 
St(0), ..., st(7) 


MMXO, ..., 
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32-bit streaming 
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Like the 32-bit processors, the 64-bit x86 processors 
(e.g., AMD Opteron, and Intel Pentium 4F), extend 
the registers from 32 bits to 64 bits. The prefix “R” 
is used to indicate the 64-bit version. For example, 
the 64-bit registers include RAX, RBX, RCX, RDX, 
RSI, RDI, RBP, RSP, RFLAGS, and RIP. There are 8 
64-bit general registers R8-R15. 


Instructions 


Instructions may be classified into data transfer, 
arithmetic, logic, bit shifting, control, etc. Data 
transfer instructions are used to move data from one 
place to another. Available transfers occur at 
register to register, register to memory, memory to 


register, and immediate value to register. Note that 
there is no memory to memory transfer in the x86 
instruction set. For example, we may set AX to a 
value 1234h as illustrated in Table 5. The postfix 
“h” indicates that the value 1234 is a hexadecimal. 
The description after the semicolon is comment to 
what this state is doing. Comments are used for the 
programmers to document what each statement is 
doing, and they have no effect in program 
execution. 


Table 5 Set AX to 1234h 


MOV AX, 1234h ; set AX to 1234h 


Arithmetic instructions include addition (ADD), 
subtraction (SUB), multiplication (MUL), increment 
(INC), and decrement (DEC). Instructions are 
designed to work on some specific registers. For 
example, MUL can only take CX, i.g., the operand 
has to be stored in the CX register for the MUL to 
work correctly. Some instruction may take 
immediate value such as ADD. Some don’t. Table 6 
lists the arithmetic instruction in the x86 
architecture. 


Table 6 Arithmetic Instruction of the x86 Instruction 
Set 
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To add two numbers together, say 1 and 2, we first 
set AX to 1, and set BX to 2. Then use ADD 
instruction to add AX and BX together. The result 
will be in AX as is illustrated in Table 7. 


Table 7 Add Two Numbers Together in the x86 
Assembly Programming 


MOV AX, lh ; set AX to lh 
MOV BX, 2h ; set BX to 2h 
ADD AX, BX ; add ih and 2h 


There are three logic instructions: logic and (AND), 
logical not (NOT), and exclusive or (XOR). The logic 
OR operations may be achieved by logic AND and 
NOT instructions. This is achieved by the De 
Morgan’s law, . The logic instructions may be 
applied to a half-word of the registers. Table 8 lists 
examples of logic operations in the x86 architecture. 


Table 8 Examples of Logic Operations in the x86 
Architecture 
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XOR CX, 5h ; CX := CX 0101 


Control instructions change program execution 
flows. They are typically used in a decision structure 
like if-then-else statements or a loop. Since the IP 
points to the next instruction to be executed, the 
control instructions will set the IP a new value 
based on some conditions. A typical condition 
would be whether the result of the previous 
operation is zero or not. If there is no need to set the 
condition, the program flow will be altered 
unconditionally. Prior to a control instruction, a test 
instruction such as CMP is normally used to set 
conditions. In a loop implementation, INC or DEC 
instructions are used to update a loop variable, 
which controls the number of the needed iterations. 
Table 9 lists examples of the control instructions in 
the x86 architecture. 


Table 9 Examples of Control Instructions in the x86 
Architecture 
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CMP BX, 4hJE L123 ; compare BX and 4h; 
jump if BX= =4h 


Assembly Program Format 


Programs written directly in mnemonics instructions 
are assembly programs. Most of the assembly 
programs follow a general format. Basically, one 
statement is occupied one line in a text file, and the 
line is divided into 4 columns. The first column is 
used for labels. The second column describes 
operation code. The third column specifies 
operands, and the fourth column after a semicolon 
is for comments. Labels are place holders for 
addresses corresponding to where they are in the 
code segment. Programmers may choose any labels 
they like as long as they are legal and not conflict to 
those used by the system such as mnemonics 
instructions. The operation code (op code) is the 
mnemonics for the instructions. Normally, they are 
not case sensitive, i.e., you may use upper cases or 
lower cases for mnemonics. However, some 
assembler may be designed to be case sensitive. In 


that case, lower cases and upper cases are different. 
Between each column, a TAB may be used to line u 
each column horizontally. It is strongly 
recommended that programmers should follow the 
format to make the code clear and easier for 
debugging. 


Examples of the x86 Assembly 


With the control instructions and the arithmetic 
instructions, a loop that adds integers from 1 to 10 
may be implemented as depicted in Table 10. Note 
that the program performs the summation from 10 
down to 1, and the register BX is used as a loop 
counter (variable). Doing this way will save one 
instruction that compares if BX reaches 10, should 
the summation were to be performed from 1 to 10. 
What is worse is that the comparison instruction 
must be executed in each iteration! 


Another example that uses 32-bit registers in 
Pentium II is illustrated in the following section. The 
32-bit registers have an “E” prefix. Here, we add 
two numbers in memory and store the sum back to 
memory. An assembler directive (DW) is needed to 
allocate memory for the variables. The DW directive 
means “define a word” in memory. In 32-bit 
processors, DW will reserve a 32-bit word in 
memory whereas in 16-bit processors, it will reserve 
a 16-bit word in memory. Once we define words in 


memory, we may give them labels for later 
references. The code is depicted in Table 11. After 
the snippet of this program is executed, the result 3 
should be written to the memory location indicated 
by X. 


Motorola 680x0 
1. Motorola 680x0 Assembly Programming 


Released in 1979, Motorola 68000 is a 16-bit /32- 
bit CISC microprocessor, originally designed for 
high performance systems such as Alpha 
Microsystems computers, Hewlett-Packard’s 
HP9000, Sun Microsystems’ Sun-1, Digital 
Equipment Corporation’s VAX station, and Silicon 
Graphics’s IRIS 1000 and 2000. Later, the processor 
is also used extensively in embedded systems, and 
industrial control systems. 


Registers 


The Motorola 680x0 has eight 32-bit general- 
purpose data registers (DO-D7), and eight address 
registers (AO-A7). The last address register A6 is 
actually the stack pointer (SP), and thus SP is 
equivalent to A7 in assembly programming. 
Separating address from data is unique in the 68000 


processor. One can easily identify which register 
holds data and which register hold addresses. In the 
68040 floating-point coprocessor, there are 8 
floating-point data registers (FPO-FP7) used for 
storing floating-point data. Comparison, arithmetic 
and logic operations set bit flags in a status register 
to be tested by later conditional jumps. The bit flags 
include "zero" (Z), "carry" (C), "overflow" (V), 
"extend" (X), and "negative" (N). The "extend" (X) 
flag may be used for rotation shift operations, which 
will not affect the carry flag for use in flow-of- 
control purpose. 


Instructions 


Data movement instructions include MOVE, and 
FMOVE. The MOVE instruction transfers data from 
memory to memory, memory to register, register to 
memory, and register to register. So data movement 
in 68000 is quite flexible. Additionally, there are 
other specific move instructions such as MOVEQ 
which moves immediate data to a register (DO-D7). 
The FMOVE instruction transfers data to/from 
floating point data registers. 


Instructions may operate upon a designated width of 
their operands using a suffix. The 680x0 assembly 
language adopts the following suffixes. 
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Thus, the following statement will move the low 
byte of DO to the low byte of D1. 


MOVE.B DO, D1 


The length of the 680x0 family instructions is at 
least one word, and at most 11 words. The first 
word specifies the length of the instruction, the 
effective addressing mode, and the operation. The 
rest of the words specify further details of the 
instruction. 


Addressing Modes 


The addressing modes in the 680x0 family are quite 
extensive. 


Effective addressing modes: 
Data register direct mode uses one of the data 


register (DO-D7) for the operand. For example, the 
following statement moves data stored in DO to D1. 


MOVE.W DO, D1 


Address register direct mode uses one of the address 
registers (AO-A7) to hold an operand. For example, 
the following statement moves data stored in AO to 
D1. 


MOVE.W AO, D1 


Address register indirect mode uses one of the 
address registers (AO-A7) which contains the 
address of the operand. For example, the following 
statement move the data stored in the address 
stored in the register AO, to D1. 


MOVE.W (AO), D1 


Address register indirect with postincrement mode 
is similar to the address register indirect mode 
except the address register will be incremented after 
the instruction is executed. This is very efficient in 
array data manipulation. The increment of the 
address register is subject to the operand width 
dictated by the suffix of the instruction, which could 
be one, two, or four for byte, word, or long word, 
respectively. The following shows an example that 
the address register AO will be incremented by two 
after the instruction is executed. 


MOVE.W (AO)+, D1 


Address register indirect with predecrement mode 


decrements the address register before the 
instruction is executed. The amount of decrement 
depends on the size of the operand, which is one, 
two, or four. The following gives an example that 
the address register AO will be decremented by 2 
before the instruction is executed. 


MOVE.W -(AO), D1 


Address register indirect with displacement mode 
specifies a 16-bit displacement to be added to the 
address stored in an address register. The 
displacement is sign-extended to a 32-bit number 
before the summation. The following example, move 
the data in the memory address (AO + 2) to the data 
register D1. 


MOVE.W (2, AO), D1 


Program counter indirect with displacement mode 
uses program counter as the address reference 
instead of an address register. The following 
example moves data stored in the memory address 
(PC + 2) to D1. 


MOVE.W (2, PC), D1 


Absolute short addressing mode specifies its operand 
using the extension word (2 bytes) following the 
instruction. The 16-bit word is sign-extended to 32- 
bit before it is used. 


MOVE.W (100).W, D1 


Absolute long addressing mode specifies its operand 
using the two extension word (4 bytes) following 
the instruction. The first 16-bit word is the high- 
order part of the address whereas the second 16-bit 
word is the low-order part of the address. They two 
words will for a 32-bit effective address for the 
operand in the memory. The following gives an 
example. 


MOVE.W (100).L, D1 


Immediate addressing mode specifies the operand as 
a value which occupies one or two extension words. 
The following example moves the value 1234 to D1. 


MOVE.W #1234, Dl 


There are other sophisticated addressing modes 
using an index register which contains size and 
scaling information. For more information, please 
refer to the programmer’s reference manual by 
Motorola. 


Summation in 680x0 


The following example (Table 12) illustrates adding 
array elements together in a loop. The assembly 
program follows the standard 4-column format. In 


this example, data are stored in memory pointed by 
the address register AO. After each iteration, the 
value of AO is increased by 2, which points to the 
next element of the array in memory. The DBF is an 
instruction that decrements the counter (D2) and 
check if it reaches -1. If yes, the loop terminates. 
Otherwise, the loop continues and the next array 
element is retrieved and stored in DO. It is then 
added to the D1 using long word operations. The 
long word (4 bytes) accommodates large integer 
numbers up to. 


MOVEO #0-DO initialize DO to 0 

PMO EQ #H=-—>l initialize D1 to 0 

MOVEO #O-D initialize D2 to 9 

MOVE.W (A0)+,DO0 move data @AO 
to DO, then 
inerease-A0-by-—2 

ADD.L DO,D1 D1:= DO +D1 in 
long word 
annaroti 
VRRP aucr 

DBF D2, Loop D2:=D2-1, jump 
if 2-1— 1] 

loop 


Table 12 Sum Array Elements in the 680x0 


Assembly Programming 


Not all instructions in the 680x0 family are equally 
fast. For example, the instruction CLR is 2 clocks 
slower than the MOVE instruction. Therefore, the 
example shown in Table 12 may use CLR.L DO to 
clear the register. It will do the job but the MOVE 
instruction is faster. To write a high performance 
assembly program in the 680x0 assembly, a 
programmer has to fully understand the 
performance of each instruction and use it 
accordingly. 


Sparc Assembly Programming 


Sparc stands for Scalable Processor ARChitecture, 
and is RISC processor with a 32-bit virtual address 
space. Sparc is a Big-endian machine meaning that 
the high byte is stored in the lower address. There 
are som similarities between Sparc and x86 
architecture such as byte-addressable memory, two's 
complement for signed integers, floating point 
follows IEEE standard, arithmetic, logical, and shift 
operations, branching and calling instructions, 
condition codes used for branch decisions, and stack 
frame support (sp, fp/bp) for procedure calls. Both 
Sparc and x86 can be pipelined and have 
superscalar implementations. 


There are differences between Sparc and x86. Table 


13 lists the difference between the two 


architectures. 


Table 13 The Difference of Sparc and x86 


Architectures 
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Registers 


There are 32 registers in Sparc: 8 global, 8 local, 8 
output, 8 input. The register numbering is shown in 


Table 14. 


Table 14 Registers in Sparc 


Sparc Registeis Symbolic Remark 

Global %g0 (Mr0)%gl %g0 always zero 
C0LA+N)N OL (04471 
M420) %eF ers 

Output %00 (%r8)%01 %rl4=%sp stack 
(%r9).%07 pointer%r15 for 
(04r1 5) roturp-address 

Local %10 (%r16)%l1 
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Input %iO (%r24)%il %r30=%fp 
(%r25).%i7 frame pointer 
(%r31) %r31 for return 
address 


1. Instructions 


There are 69 instructions which are all fixed 32 bits 
in length. Only load/store may access memory. 


Central Processing Unit (CPU) 

This chapter describes Central Processing Unit 
(CPU) with its types and processing power. 
Introductory Material and Performance Technology 
trends, measuring CPU performance, Amdahl’s law 
and averaging performance metrics. Components of 
instruction sets, understanding instruction sets from 
an implementation perspective, RISC and CISC and 
example instruction sets. Ripple carry, carry 
lookahead, and other adder designs, ALU and 
Shifters hardware design. Single-cycle and multi- 
cycle datapaths, control of datapaths. 


Central Processor Unit (CPU) 


CPU is the brain of a computer system. All the 
computations are performed in CPU. It controls 
other peripherals, monitors input/output, reports 
errors, and the like. A computer system cannot live 
without a CPU. 


Introduction 


CPU is the most important part of a computer 
system (perhaps, most expensive as well). It is, 
often, referred to simply as the central processor, or 
just processor. CPU performs most calculations. In 


early computers, CPU may be manufactured on one 
or more printed circuit boards. Today, CPU used in 
personal computers (PC) is built in a single chip 
called a microprocessor. Note that PC has another 
meaning in computer architecture discipline, which 
is program counter. Since the 1970's, 
microprocessors have dominated the CPU market 
and implementation. 


The package of CPUs could be square, rectangle, or 
round. They are small and their sizes are up to 
several square inches. Intel and AMD are two major 
CPU manufacturers for PCs. Normally, a CPU will 
have up to several hundred metallic connectors or 
pins to outside for receiving or issuing signals. Most 
motherboards of PCs come with a CPU socket to 
keep CPU in place. Each CPU will have a 
corresponding CPU socket, and the socket is 
designed with fool-proof, i.e., it can only accept the 
right type of CPU with the right orientation. Other 
small CPUs such as SMDs (surface mount devices) 
are soldered on a circuit board. Each motherboard 
will support only a specific type or range of CPU. 
Small fans on a heat sink are normally required and 
installed on top of a CPU to effectively dissipate 
heat. 


Typically, a CPU is composed of the following: 
arithmetic logic unit (ALU), control unit, and 
registers. ALU performs arithmetic and logical 
operations, such as addition, subtraction, logical 


AND, logical OR, logical NOT, shifter, and so on. 
Normally complex operations like multiplication or 
division are implemented in another unit. The 
control unit fetches instructions from memory, 
stores it in an instruction register, decodes the 
instruction, and sends control signals to other units 
within CPU. For example, the control unit will send 
ADD opcode to the ALU when an add instruction is 
in execution. 


Von Neumann Machines 


Early computer design, specifically calculators, has 
the “program” hardcoded in the computer. So the 
computer can only run one specifically program. It 
would be really hard to run another program 
because the whole hardware has to be reconfigured 
and rebuilt. John Von Neumann proposed a 
computing model that runs a stored program, a set 
of instructions, sequentially. This makes running 
another program easily, simply replacing the stored 
program with another program. 


RISC vs. CISC 


In general, we may classify CPUs into RISC and CISC 
architectures. RISC stands for reduced instruction 
set computer whereas CISC abbreviates complex 


instruction set computer. The RISC design is 
centered on simple. Each of the instructions in the 
RISC chips is simple. Only a handful of instructions 
are required. A complex operation is then 
decomposed to several simple instructions. 
Therefore, by and large it achieves a higher clock 
rate and performs more instructions per clock cycle 
than a CISC processor. 


On the other hand, a CISC chip has a large amount 
of different and complex instructions. A complex 
instruction will require more hardware resource 
than a simple one. Owning to the increase density of 
VLSI chips, the chip space is not an issue. With the 
idea that hardware is always faster than software, 
and an abundant hardware space, packing a 
powerful instruction set in a chip is what the CISC 
camp advocates. Meanwhile, the more powerful 
instruction set, the less number of instructions and 
the shorter programs will be. 


Because the complex instruction design, CISC chips 
require longer time to run an instruction. However, 
for a specific task, the CISC program needs less 
instructions compared to the RISC counterpart 
because each complex instruction may fulfill several 
operations. CPU manufacturers, such as Intel and 
AMD, produce CISC processors (x86 architecture), 
while Apple (Motorola 68000/IBM PowerPC) and 
SUN (Sparc) promote RISC architecture. Most 
programs developed for CISC architecture (AMD/ 


INTEL/VIA) are compatible. However, the RISC 
programs are not quite compatible. 


There always exists a dispute between RISC and 
CISC architectures as which one is better. The RISC 
camp argues that the design is simple, the 
instruction is faster, and the chip is cheaper. By 
making the hardware simpler, the software has to be 
sophisticated to achieve performance. Thus, 
software becomes very complex. System software 
developers need to generate more lines of code for 
the same tasks than the CISC architecture. As is 
indicated by the Moore law that the chip density 
will be doubled in every eighteen months, the CISC 
camp claims that the CISC chips are becoming faster 
and cheaper anyway. 


Chip design is toward to a hybrid architecture. 
Many of the recent CPUs support both RISC and 
CISC instructions. For example, traditional RISC 
architectures provide data movement instructions 
between memory and registers only, but MSP430, a 
RISC-like architecture, has instructions that move 
data from one location to another in memory. In 
terms of the number of instructions, there are more 
instructions in CISC than that in RISC. However, the 
PowerPC 601 (RISC) supports more instructions 
than the Pentium (CISC). Moreover, some 
techniques, such as pipelining, used in RISC are 
found to be applied in many CISC CPUs. Therefore, 
the hybrid architecture may be prevalent in the 


future. 


I/O 


There are several ways of data input and output (I/ 
O) to/from CPU and other devices. 


1. Buses 
2. Component Selection 
3. Intel Microprocessor Development 


A little history about microprocessor development in 
Intel is depicted in the following sections. The 
sections are described by the number of bits of data 
processed by a processor at a time. An n-bit 
processor is one that processes n-bit data in its 
instruction. Of course, if the operation is binary, 
both operands are of width n bits. 


4-Bit Processor 


The early CPUs Intel designed are for calculators 
such as 4004 used in Busicom calculator. The Intel 
4004 processor is the first single-chip 
microprocessor (released in 1971), running at 740 
KHz with 4-bit bus width, and 640 byte addressable 
memory, and 4 KB program memory. The program 
memory and data memory are separated. Instruction 


length is 8-bit (one byte), and data word is 4-bit. 
There are 46 instructions (41 are 8-bit wide, and 5 
are 16-bit wide), and 16 registers of 4 bits each. All 
communication including data and addressees 
between CPU and RAM/ROM is via a 4-bit data bus. 
Therefore, it requires several clock cycles to fetch 
and execute instructions. As an example, the ADD 
instruction adds a register to the accumulator will 
require 3 clock cycles to send the PC (12-bit 
address; 4-bit at a time), two cycles to fetch the 8- 
bit instruction (4-bit at a time), one cycle for 
instruction register, one cycle for temp operand 
register, and one cycle for write back to a register. 
So totally the Intel 4004 requires 8 clock cycles to 
execute an 8-bit instruction. For 16-bit instructions, 
they require 16 clock cycles. Since the Intel 4004 is 
designed for calculator, recursion is not allowed, 
and there is a limit on procedure calls for three 
levels deep only, i.e., a longest procedure call chain 
is four. The stack to keep each of the procedure 
context is hardcoded in the CPU. Since it processes 
4-bit data at a time (e.g., add two 4-bit data 
together), the Intel 4004 is called 4-bit processor. 


Intel 4040 (released in 1974), the successor of the 
Intel 4004, adds interrupt and single step features 
with extensions: 60 instructions, 8 KB program 
memory, 24 registers, 7-level deep procedure calls. 


8-Bit Processor 


Introduced in 1972 and developed in tandem with 
4004, Intel 8008 is the first 8-bit processor running 
at 500 KHz. Its data and address share an 8-bit wide 
bus. The total addressable program memory is 16 
KB. The Intel 8008 was originally designed for the 
Datapoint 2200 programmable terminal (with 2 
cassette tape drives, each with 130KB capacity) of 
the Computer Terminal Corporation (CTC) company 
located at San Antonio, TX, later renamed to 
Datapoint for its high volume sales of Datapoint 
2201. However, the original Datapoint 2200 did not 
equip with the 8008 for its delayed delivery and not 
meeting CTC’s performance goal. Instead of a 
microprocessor, the CTC’s TTL design (about 100 
SSI/MSI chips) was used in the datapoint 2200. 
Nevertheless, the seminal importance of the 8008 
design is that it is the first member in the x86-family 
CPUs. 


Two years after the release of the 8008, Intel 
introduced 8080 in 1974. The 8080 greatly 
increases clock rate to 2 MHz about 10 times faster 
than 8008. The data bus is 8-bit wide and address 
lines are 16 bits. So the total addressable memory is 
B, i.e., 64 KB. The 8080 is the first CPU that is 
designed to have an address bus separated from the 
data bus. The assembly language is downward 
compatible with 8008, i.e., the program running on 
8008 may also be running on 8080. It is used in 
traffic light controllers. 8080 also introduces a 16- 
bit stack pointer, which replaces the internal stack 


of 8008. It is interesting that 8008 requires two 
transactions to transfer a 14-bit address to a 
memory address register (MAR) via its 8-bit data/ 
address bus. Obviously, there is a waste of 2 bits in 
the second address transaction. The addressing 
ability of the 8008 is limited by its 14-bit program 
counter, and thus it is addressable to 16 KB 
memory. 


The 8-bit CPU development later on focuses on 
power supply, direct memory access (DMA), etc. 
The digit “5” in 8085 (released in 1976) indicates 
the CPU requires 5V only, instead of it predecessors 
which requires both 5V and 12V. The 8085 also 
introduced maskable and non-maskable interrupts, 
and DMA. There are 7 8-bit registers in 8080/8085, 
named A, B, C, D, E, H, and L, where A is the 
accumulator, and the rest may work as independent 
byte registers or three 16-bit register pairs, named 
BC, DE, and HL. Though 8080/8085 is an 8-bit 
processor, it has some 16-bit operations. For 
example, the DAD instruction may add any of the 
16-bit register pairs to the HL register pair. 

The Harvard architecture, originated from Harvard 
Mark I, an electro-mechanic computer, is a 
computer architecture with a separate data storage 
from the program memory. 


8-Bit Microcontroller 


During the development of 8-bit processors, Intel 


integrates CPU with ROM, RAM, timers, interrupts, 
I/O ports, etc., to a single component 
microcontroller. A design using the 8080 processor 
may also need tens of other TTLs such as RAM, 
ROM, and other supporting ICs. This integration 
greatly simplifies application design and shortens 
time to market. The first microcontroller (8048, 
8035, 8748) was released in 1976, called MCS-48 
series. The 8749 comes with 2K EPROM (erasable 
programmable ROM). MCS-48 has a similar design 
to Harvard architecture[footnote], with internal (up 
to 4K) or external ROM, and 64-256 bytes on-chip 
RAM. Due to its low cost, and full-fledge 
development tools, MCS-48 is used extensively in 
consumer electronics such as PC keyboards, TV 
remotes, toys, etc. MCS-48 was later replaced by 
MCS-51 (higher capacity and more features). 


MCS-51 was released by Intel in 1980. Owning to its 
simplicity, MCS-51 was quite popular for embedded 
systems. From then on, it has been used for 
introductory microcontroller courses in engineering 
schools. Thought Intel had discontinued MCS-51 
product lines, other venders are still producing and 
enhancing MCS-51 products. Different vendors may 
manufacture slightly different MCS-51 but the basic 
features should be included. Note that the “C” in the 
produce code indicates CMOS, which consumes less 
power than its NMOS versions. 


Bit-Slice Processor 


In late 1974, Intel introduced bit-slice processors 
(3000 family). Bit-slicing is a technique to build a 
processor using components of smaller width. For 
example, a 4-bit processor may be built by 
cascading 4 one-bit processors. One of the 
advantages of bit-slicing is that a complex processor 
can be economically built from off-the-shelf 
components, and bipolar transistors may be used in 
a smaller processor that make it much faster than 
using NMOS or CMOS transistors. The 3000 family 
includes control unit, 2-bit ALU, look ahead carry 
generator, etc. 


16-Bit Processor 


Eight years after the introduction of 8-bit 
processors, Intel released 16-bit processor x86 
family in 1978. The clock rate is increased up to 10 
MHz. The memory is organized in odd and even 
banks so as to read 16-bit data in one clock cycle. 
The first X86 processor released is 8086. Its width of 
data bus is 16, and that of the address bus is 20. 
Therefore, the addressable memory is , i.e., one 
megabyte. IBM PC XT is based on 8088, where the 
data bus is 8-bit wide instead of 16. IBM PC AT is 
based on 80286 running up to 25 MHz. 80286 has a 
data bus of 16-bit wide and an address bus of 24-bit 
wide. Therefore, its addressable memory is 16 MB. 


80286 was released in 1982, where 80186 was also 
released but its fixed address design for DMA 
controller, timers, and interrupts, is different from 
IBM PC. Thus, there is no PC equipped with 80186. 
It is worth mentioning that 80286 added memory 
protection hardware to support multitasking 
operating systems with per-process address space. 


32-Bit Processor 


Seldom known is that the first Intel’s 32-bit 
processor is iAPX 432, released in 1981, not the 
well-known 80386DxX, released in 1985. The iAPX 
432 is built to support multitasking, object-oriented 
programming and memory management such as 
garbage collection in hardware. Unfortunately, the 
complex design led to slow clock rates much less 
than the targeted 10 MHz; Only up to 8 MHz is 
achieved. Its bit-aligned variable length instructions 
(complex decoding), fault tolerance bus interface 
unit (40% bus time waste in wait states), and lack of 
caches (memory latency) result in low performance. 
The performance is about 4% the speed of the 80286 
running at the same clock rate. However, the 
expensive hardware design may have to be exposed 
to software to achieve high performance. An Ada 
compiler implemented without taking advantage of 
the underlying hardware features turn out to be the 
major problem. For example, the Ada compiler uses 
the expensive inter-module procedure calls for every 


procedure, instead of an obvious branch and link 
instruction; it also uses an enter_environment call to 
set up memory protection for every variable, though 
most of the time a program is running inside an 
existing environment and need not be checked; 

what even worse is that call by value is always used 
in passing parameters, resulting in a huge amount of 
duplicated memory content. 


The iAPX 432 supports object-oriented memory and 
garbage collection. The memory design adopts 
segmented memory with up to segments of up to 64 
KB each. The total virtual memory space is bytes 
and the total physical memory is bytes. Programs 
use a segment address and an offset within that 
segment to access an object. Segments are 
referenced by an access descriptor which contains 
an index to a system object table and a set of access 
rights. Segments may contain either access 
descriptors, or object data. Therefore, an object 
access involves 1) read access descriptor and check 
access rights, 2) read system object table and find 
out object address, and 3) read the data. To improve 
object access performance, a later version of iAPX 
432 release 3 combines one access segment and one 
data segment to form a 128 KB segment, which 
eliminates one memory access and doubles the 
virtual memory space. The system object table that 
keeps object addresses may be used for mark-sweep 
garbage collection. Unlike C programming, objects 
created in iAPX 432 do not need to be deallocated 


when they are no longer needed, or they become 
garbage. There is no explicit instruction for freeing 
an object in the system. Part of the Dijkstra’s 
parallel mark-sweep garbage collection is 
implemented in microcode using the system object 
table. Each object is marked as black, white, or grey 
as needed. The operation system includes the other 
part of the garbage collection to complete the 
garbage collection function. 


32-Bit x86 Processor 


The first 32-bit processor in the x86 family is 
80386DxX, released in 1985. It supports clock rates 
up to 33 MHz. Both the data bus and address bus 
are 32-bit in width that supports 4 GB addressable 
memory space, and 64 TB virtual memory. Memory 
protection realizes paged virtual memory and 
virtual-86 mode. These memory features are 
fundamental requirements of modern operating 
systems such as OS/2, Linux, Vista, and Mac OS. 
The 80386 processor is mainly used in desktop 
computing. A lower cost version of this processor is 
80386SX, released in 1988. The 80386SX uses a 16- 
bit data bus, and a 24-bit address bus, with 
addressable memory 16 MB, and virtual memory 32 
GB. The internal architecture is 32-bit similar to the 
DX version. However, there is no Math co-processor. 
The SX version is targeted at mobile computing and 
entry-level PCs. The 80386SL, released in 1990, has 


a larger addressing capability than its SX version. Its 
addressable memory is 4GB, and virtual memory 
size is 1TB. The 80386EX, released in 1994, is 
similar to 80386SX but with a lot of on-chip 
peripherals such as timers, power management, I/O, 
DMA, JTAG test logic, etc. 


The 80486DxX, released in 1989, integrates Math co- 
processors, and 8 KB level-one cache. Most of the 
80386 motherboards have an IC socket for 80387, 
the Math co-processor. What makes the 80486 
different from 80386 is that its level-one cache, on- 
chip Math co-processor, address memory space is 4 
GB, and virtual memory size is 1 TB. It also pushes 
clock rates up to 50 MHz. In 1991, Intel released 
80486SX, which is identical to the DX version but 
without the Math co-processor. Note that the 
80487SX is the Math co-processor, which is the 
same as 80486DxX with different pin configuration to 
prevent users from installing an 80486DxX instead of 
80487SX. Released in 1992, 80486DX2 improved 
the clock rates up to 100 MHz, about twice the 
speed as the first DX version. The 80486SL, released 
in 1992, is used for laptop computers. The 
80486DX4, released in 1994, with 4 GB addressable 
memory and 64 TB virtual memory, is used in high 
performance desktop and laptop computers. 


Pentium processors, released in 1993, are 32-bit 
processors with a 64-bit data bus and a 32-bit 
address bus. The addressable memory is 4 GB, and 


virtual memory size is 64 TB. It is a superscalar 
architecture, running on 5 volts, and used in 
desktop computing. It has a larger level-one cache 
of 16 KB. The clock rates of the Pentium processors 
range from 60 MHz to 300 MHz. Dependent on 
different process technologies, the original Pentium 
processor has several versions: P5 (0.8 ) in 1993, 
P54 (0.6 ) in 1994, P54CQS (0.35 ) in 1995, P54CS 
(0.35 ) in 1995, and P55C (0.35 ) in 1997. The 
P55C is a Pentium processor with an MMX 
(multimedia extension) instruction set, and L1 cache 
of 32 KB. 


P6 starts with the Pentium Pro processor, released 
in 1995, which leads to Pentium II and Pentium III. 
Pentium Pro is the first Intel processor with 2 level 
caches: 16 KB L1 cache and 256/512 KB L2 cache, 
running up to 200 MHz. Pentium II, released in 
1997, adds MMxX to Pentium Pro, with 32 KB L1 
cache and 512 KB L2 cache. Starting from Pentium 
II, the processor package style is changed to a single 
edge contact cartridge (SECC). With the 0.13 
process technology, Pentium II-based processor 
reaches a 500 MHz clock rate. Pentium II Celeron, 
released in 1998: Covington has 32KB L1 cache but 
no L2 cache running up to 300 MHz, and 
Mendocino has 32KB L1 cache and 128 KB 
integrated cache, running up to 500 MHz. Pentium 
II Xeon, released in 1998, comes with up to 2 MB L2 
cache running up to 450 MHz. 


Based on Pentium II architecture, Intel added 
streaming SIMD extension to Pentium III processors. 
It also improves L2 transfer, and pushes clock rates 
to 1.4 GHz. Introduced in 1999, Pentium III Xeon 
has L2 cache of size up to 2 MB, and a 64-bit system 
bus. It is worth noting that there are Pentium III 
versions of Celeron (released in 2000), which adopt 
socket 370 instead of SECC. They are mainly for 
mobile computing. Other Intel 32-bit processors for 
mobile computing include Pentium M (2003), and 
Celeron M (2003). In 2006, Intel started producing 
dual core processors such as Core Duo and Dual 
Core Xeon LV, both of which are equipped with 
improved SSE3 SIMD instructions. 


There is another branch of P6 processors called P68, 
NetBurst microarchitecture, which includes hyper 
pipelined technology and rapid execution engine. 
The hyper pipeline is a deeper pipeline (20 or 31 
stages versus 10 stages in Pentium III). The deeper 
pipeline will increase the number of instructions per 
cycle (IPC). Thus, the performance is improved. 
However, the deeper pipeline also means high 
penalty on branch mis-prediction. Intel reduces the 
mis-prediction by introducing the rapid execution 
engine with a claim of 33% reduction in mis- 
prediction. Processors in this category include 
Pentium 4 (2000), Itanium (2001), Xeon (2001), 
Itanium 2 (2002), Mobile Pentium 4-M (), Pentium 
4 EE (2003), Pentium 4 E (2004), and Pentium 4 F 
(2004). 


64-Bit Processor 


There are instruction set architectures for Intel’s 64- 
bit processors: IA64 and Intel 64. The IA 64 a new 
instruction set, totally different from x86. It is a 64- 
bit parallel architecture, implementing branch 
predication, speculation, and prediction. The branch 
predication is a technique to reduce mis-predication 
cost by allowing each instruction to conditionally be 
executed or nop (no operation). There are 128-bit 
registers to hold instructions, and each of the 128- 
bit register (one instruction word) contains 3 
instructions. Each fetch may read 2 instruction 
words from L1 cache, resulting in 6 instructions in 
execution per cycle. Intel 64 is also known as 
AMD64 or x86-64. Intel 64 is an extension of the 
x86 instruction set. This architecture has been 
implemented by AMD, Intel, and Via. IA 64 and 
Intel 64 are not compatible. 


Processors in IA64 include Itanium (2001) and 
Itanium II (2002). Processors in Intel 64 include 
Pentium 4F(2005), Pentium D (2005), Pentium EE 
(2005), Xeon (2004), Core 2(2006), Pentium Dual 
Core (2007), Celeron, Celeron M, Core i3, Core i5, 
and Core i7. 


Case Study 80C51 


The 80C51 is a mixed-signal MCU with part number 
C8051F005. It has 2304 bytes (256B + 2KB) data 
RAM, 32KB Flash, 4 Bytes Port I/O (PO-P3), 4 16-bit 
timers, programmable oscillator clock (2 
(default)-16 MHz), two 12-bit DACs, one 12-bit 
ADG, and 21 vectored interrupt sources. The 
following figure shows the components in the 80C51 
chip. 
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1. Memory Map 


Perhaps, the most important thing to know before 


programming is the memory map in addition to 
instruction set. The 80C51 has a standard program 
and data address configuration. It includes 256 
bytes of data RAM, with the upper 128 bytes dual- 
mapped. Indirect addressing accesses the upper 128 
bytes of general purpose RAM, and direct addressing 
accesses the 128 byte SFR address space. The lower 
128 bytes of RAM are accessible via direct and 
indirect addressing. The first 32 bytes are 
addressable as four banks of general-purpose 
registers, and the next 16 bytes can be byte 
addressable or bit addressable. The following figure 
shows the memory map. It is worth mentioning that 
there are 3 memory spaces (one for program, one 
for internal data, and the other for external data). 
They all start from 0x0. This is a drastically different 
design compared to MSP430! 
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0x0000 


mov A, direct; direct addressing, move direct to 
Amov direct, A; direct addressing, move A to 
directmov A, @R; indirect addressing, move indirect 
to Amov @R, A; indirect addressing, move A to 
indirect 


Based on the above instructions with addressing 
modes, add two statements to the tutorial program: 
one to set the value 0x12 at OxFO in the RAM; the 
other to set the value 0x34 at OxFO for the SFR B. 
Show your work to the instructor. 


One of the four banks of general purpose registers 
(0x00 — Ox1F) may be activated at a time. So there 
are 8 registers (RO to R8) you may manipulate at a 
time. Two bits in the program status word (PSW), 
like SR in MSP430, determine which bank is active. 
This allows fast context switching when entering 
subroutines and interrupt service routines. Indirect 
addressing modes use registers RO and R1 as index 
registers. 


Bit addressable memory form 0x20 to 0x2F may be 
accessed as 128 individual addressable bits. Each bit 
has a bit address from 0x00 to Ox7F. Bit 0 of the 
byte at 0x20 has bit address 0x00 while bit 7 of the 
byte at 0x20 has bit address 0x07. Bit 7 of the byte 
at Ox2F has bit address 0x7F. The instruction MOV 
C, 22h.3 will move the 3rd bit at 0x22 to the carry. 
The instruction setb and clr are used to set or clear a 
bit. This notation will make the bit manipulation 


easier. Add two statements to the tutorial program 
to 1) set the value 0x21 at 0x23 in the RAM, and 2) 
set the first bit (bit 1) at 0x23 in the RAM. Show 
your work to the instructor. 


Program Status Word (PSW) 


One of the SFRs is PSW, which is at OxDO. It keeps 
execution status. 
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RS1 and RSO are used to select one of the four 
register banks. 
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In 80C51, arithmetic operations are done in a 
special register call accumulator (ACC), which is 
one of the SFRs and located at OxEO. All the SFRs 
are one byte wide including ACC. Thus, 80C51 is an 
8-bit CPU because it computes one byte data per 
instruction. 


Stack Pointer, and Other SFRs 


Stack pointer (SP) in 80C51 is also one byte wide, 
located at 0x81 in SFR area. After reset, it is 
initialized to 0x07. A data pointer (DPH+ DPL), 
located at 0x83 and 0x82, is used to access 
indirectly addressed RAM and Flash memory. First 
of all, one byte can only address to 255. Any address 
larger than that has to be 2 bytes in 8051, and thus 
give a 64k address space. Since the external RAM 
and program Flash are much larger than internal 
RAM (256 Bytes only), 8051 provides a the data 
pointer to access these two memory modules. The 


following instructions are related to manipulation 
on the data pointer. 


MOV DPTR,#datal6 ;Load data pointer with 16-bit 
constant 


MOVC A,@A+ DPTR ;Move code byte relative DPTR 
to A 


MOVC A,@A+ PC ;Move code byte relative PC to A 


MOVX A,@Ri ;Move external data (8-bit address) to 
A 


MOVX @Ri,A ;Move A to external data (8-bit 
address) 


MOVX A,@DPTR ;Move external data (16-bit 
address) to A 


MOVX @DPTR,A ;Move A to external data (16-bit 
address) 


MOV DPTR, #datal6 ;Load data pointer with 16- 
MOVC A,@A+DPTR ;Move code byte relative DPTI 
MOVC A,@A+PC ;Move code byte relative PC to 
MOVX A,@Ri ;Move external data (8-bit addre: 
MOVX @Ri,A ;Move A to external data (8-bit «© 
MOVX A,@DPTR ;Move external data (16-bit ad 
MOVX @DPTR,A ;Move A to external data (16-b: 


Note that the code memory is read-only. So you may 


not write data to it. Add statements to the tutorial 
program to read the code memory at location 
0x0006, and put the result in the accumulator. 
Show your work to the instructor. 


check 


Additionally, a B register located at OxFO serves as a 
second accumulator for certain arithmetic 
operations such as MUL and DIV. Wait! Does 
MSP430 provide multiplication and division 
instructions? Nope! Other SFRs include controls for 
ADG, clock, comparator, DAC, interrupt, memory, I/ 
O, timer, bus, etc. 


Program Counter 
PC in 8051 is not part of SFRs. It is a standalone 
unit in charge of giving next instruction to be 


executed. 


Base on what you have learned so far and your 
observations, answer the following questions. 


What is the initial value of SP? 
What is the starting address of the program? 


Does the stack grow downward or upward? 


What is the maximal size of the stack in your 
program? 


Instruction Set Architecture (ISA) 

This chapter introduces instruction sets including 
components of an instruction set, and understanding 
instruction sets from an implementation perspective. 
Basic organization of the von Neumann machine. 
Control unit; instruction fetch, decode, and 
execution. Instruction sets and types. Assembly/ 
machine language programming. Instruction 
formats. Addressing modes. 


Instruction Set Architecture 


Introduction 


An instruction set architecture (ISA) defines a set of 
native instructions to be executed directly by 
hardware. It specifies native data types, instructions, 
registers, addressing modes, memory architecture, 
interrupts, and external I/O. An ISA may be 
implemented in different microarchitectures, e.g., 
Intel Pentium and AMD Athlon implement the x86 
instruction set, but their microarchitectures may be 
essentially different. A native instruction is executed 
directly by a CPU and is composed of an operator 
(opcode) and operands. A collection of instructions 


is called machine code to fulfill some function. 
Based on the design strategy, basically there are two 
models: complex instruction set computer (CISC) 
and reduced instruction set computer (RISC). In 
CISC, the length of the instruction is variable and 
thus the instruction encoding is quite complex, 
whereas in RISC, the length of the instruction is 
fixed, and therefore, the instruction encoding is 
simple. Generally, there are more instructions in 
CISC than in RISC. Companies making CISC 
processors include Intel and AMD, whereas those 
making RISC processors contain IBM, Apple, and 
Sun microsystems. 

Instruction will be used in lieu of native instruction 
throughout the book should the “native” concept is 
not critical in the context. 


Native Instruction 


Native instructions are executed by a CPU directly. 
A native instruction contains information about 
what to do (opcode), and what data to be processed 
(operands). There is only one opcode and there may 
have more than one operand in an 
instruction[footnote]. illustrates the MPS430 
instruction format. 


Table MSP430 Instruction Format 
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The first 4 bits (from bit 12 to bit 15) designate an 
opcode. Therefore, there are 16 different opcodes 
defined in MSP430. Three types of instructions are 
defined based on the value of opcode: single 
operand arithmetic (0001), conditional jump (0010, 
0011), and two operand arithmetic (0100 —- 1111). 
Not all fields are used in an instruction. We classify 
instructions into type 1 for two operand arithmetic, 
type 2 for single operand arithmetic, and jumps. 


There are 16 registers in MSP430. So they can be 
indexed by 4 bits source (Src) and 4-bit destination 
(Dst) fields. The operands are specified by these two 
fields. Once an instruction is executed, the CPU will 
look for the operands based on the values set in the 
Src and Dst fields. 


Other fields such as Ad, B/W, and As are modifiers 
an instruction. The one-bit Ad field designates the 
addressing for the destination operand. Because of 
one bit, there are two possible destination 
addressing modes. The one-bit B/W field tells if the 
instruction will be operating upon one byte operand 
or one word (two bytes in MSP430) operand. Thus, 
each instruction in MSP430 can have to versions 
typically. The two-bit As field defines the addressing 
modes for the source operand. So there are 4 
possible addressing modes for the source. 


Type I Instructions 


The two operand arithmetic instructions used all the 
fields shown in . There are 12 instructions defined 
in Type I as tabularized in . It seems odd that there 
are two required operands for an instruction in this 
category but only one source field is reserved. In 
fact, both the source and destination fields are used 
to specify the operands, and the result will be 
placed back to the destination. For example, the 
instruction “ADD src, dst” will be executed like “dst 
+ = src” in C language, i.e., the sum of sre and dst 
will be stored back to dst. 


Table Type I MSP430 Instructions 


Opcode Mnemonics Two operand 
Py rue as 
0100 MOV Move source to 
Anctinatinn 
0101 ADD Add source to 
Anctinatinn 
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Type II Instructions 


Type II instructions are single operand arithmetic. 
Since there is only one operand needed, the source 
field and the Ad field are used for opcode (bit 7 to 
bit 15). The original opcode (bit 12 to bit 15) is 
always “0001.” There are 7 instructions defined in 
this category as listed in . The first 6 bits are ID 
code, which is always set to “000100” for Type II 
instructions. The RRC instruction right rotates the 
destination register with the carry bit. Its byte 
version instruction will operate on the low byte of 
the target. The SWPB instruction swaps the high 
byte and the low byte of the destination. Since it 
operates on a word only, the B/W bit is always set 
to 0. The RRA instruction will perform arithmetic 
right rotation. The difference between RRC and RRA 
is that the carry goes to MSB in RRC but the MSB is 
duplicated in RRA to keep the sign bit. In either 
cases, the carry receives LSB after the rotations. The 
SXT instruction fills in the high byte with the sign 
bit (bit 7) for sign extension. The result is a word. 
So the B/W for SXT is always 0. The PUSH 
instruction will store the source in the stack. It first 
decrements the stack pointer by 2 and then pushes 
the source into the stack. Each item in the MSP430 
stack occupies one word (2 bytes). Thus, the stack 
pointer is decremented by 2. The stack grows 
downward in MSP430. The CALL instruction 
performs a subroutine. It first pushes next 
instruction address (next to this call instruction) into 


stack, and set the source (starting address of the 
subroutine) in program counter (PC). PC is always 
points to the next instruction to be executed. The 
instruction RETI should be placed at the end of an 
interrupt service routine. It restores status register 
(SR) and recovers return address (PC) before the 
interrupt. When interrupt occurs, the MSP430 will 
push PC to stack, and then push SR. So the control 
will be given back to the interrupted routine after 
the interrupt service routine is done. 


Table Type II Instruction Format in MSP430 
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Conditional Jumps 


Conditional jumps are the fundamentals to program 
flow control, such as if-then-else or loop structures. 
Since they will alter PC if a specific condition meets, 
most of the instruction bits are used for the target 
address. Their formats are illustrated in . There are 
8 conditional jump instructions implemented in 
MSP430. There is no field to specify the actual 
“condition” itself. Actually, before the conditional 
jumps are executed, the CPU should have set some 
flags in the status register (SR) according to the 
previous instruction. The status register contains 4 
arithmetic status bits, a global interrupt enable, and 
4 bits that disable various clocks to enter low-power 
mode. illustrates the function of each individual bit 
in the MSP430 status register. 


Table Conditional Jump Instructions 


ID Code Condition 10 Bits PC = PC + 
Signed 2 x Offset 
Offcet 

001 000 Offset JNE/JNZ 


Jump if not 


ananual / 


001 001 Offset JEQ/IJZ 


Jump if 
egual/sere 

001 010 Offset JNC/JLO 
Jump if no 
carry Mewer 

001 011 Offset JC/JHS 
Jump if 
carry/higher 

001 100 Offset JN Jump if 
nerative 

001 101 Offset JGE Jump if 
greater or 
egual 

001 110 Offset JL Jump if 

001 111 Offset JMP Jump 
(unconditionally) 


Most of the instructions will affect the C, Z, N, and 
V bits after being executed by the ALU of MSP430. 
The carry bit C flags the result of an arithmetic 
operation that causes an overflow. For example, the 
sum of OxFFFF and 0x0001 should be 0x10000, but 
the register can only hold two bytes of data. 
Therefore, the carry bit will be set to flag this 
overflow. In rotation instructions, the carry takes 
part in the shifts. It receives data from the bit 0 of a 
register for right rotations. 


Table Bits in the MSP430 Status Register 


ReserVed SC'G1SCGOOSC CPU GIE N Z C 
OFF OF? 


The zero flag Z is set whenever the result is zero 
after executing an instruction. A common use of this 
Z flag is to test if two operands are equal using a 
subtraction instruction. The combination of SUB and 
JNZ/JNE will implement the if-then structure. 


The negative flag N is set when the result is negative 
after executing an instruction. It is set to the MSB of 
the result. Like the Z flag, the N flag can be used to 
test the ordering of two operands. For example, if 
we want to execute an instruction based on the 
condition a=b, we would first perform a subtraction 
a—b, followed by a JN instruction. Note that we 
need to negate the condition as the jump instruction 
will skip the instruction we want to be executed. 


The signed overflow flag V is set when an overflow 
occurs in signed operations. For example, the sum of 
two positive numbers 0x7FFF and 0x0001 is 

0x8000. This operation will not cause a carry 


overflow if they are unsigned. Thus, the C flag is not 
set. However, if the two operands are considered as 

signed numerals, the result is negative, which is not 

right. Thus, the V flag is set for this situation. 


The global interrupt enable flag (GIE) controls the 
maskable interrupts. A two levels mechanism 
interrupt control is implemented in MSP430. Each 
maskable interrupt can be enabled or disabled, and 
controlled individually. For a particular interrupt, 
e.g., timer interrupt, to be enabled, both its 
individual control and GIE have to be enabled. 
There are non-maskable interrupts, which may not 
be masked by GIE, will be described later. 


The bits 4 (CPU OFF), 5 (OSC OFF), 6 (SCGO), and 7 
(SCG1) are used for low power control. Setting these 
bits will put the CPU to low power modes. The 
default is the full functional mode by clearing those 
bits. More discussion about low power modes will 
be provided later. 


Example of an Instruction 


An instruction is composed of opcode, operands, 
and other modifiers. The opcode designates what 
operation is, and the operands specify what data to 
be operated upon. For example, move data from R5 
to R6 will yield an instruction depicted in . The 
instruction is represented by “MOV.W R5, R6,” a 


mnemonic notation. This notation is invented for 
human beings to ease programming and 
understanding code. Without it, it would be hard to 
read a bunch of 0’s and 1’s everywhere, like this one 
“0100010100000110.” The opcode for MOV is 
0100, the Src register is 0101 (R5), the destination 
addressing mode Ad is 0, the “.W” is represented by 
setting bit 6 to 0 (B/W), the source addressing mode 
(As) is 00, and the destination register is 0110 (R6). 


Table The Instruction MOV.W R5, R6 


1r49%9 11 0 Law A ra rm A an 
iv 1a iio f VU Vv 'T Vv J 
Onandn Cra AA D /N\AT Aa Not 
vpevee vir £ant wy vy £40 wor 
0100 0101  O 0 00 0110 


We sometimes refer “0100010100000110” a 
machine instruction and “MOV.W R5, R6” assembly 
instruction. A machine code is composed of a 
sequence of machine instructions, each of which is 
being executed in sequence. It is not difficult to see 
there is a one to one correspondence between 
assembly instruction and machine instruction. The 
translation from assembly instructions to machine 
instructions is done by an assembler. 


Emulated Instructions 


There are 12 type I instructions, 7 type II 
instructions, and 8 conditional jump instructions in 
MSP430. So there are totally 27 native instructions. 
To ease programming, a set of emulated instructions 
is defined. For example, the emulated instruction 
EINT enables global interrupt. To do that, we may 
set the GIE bit in the SR register. So it essentially 
perform “BIS #8, SR” instruction. Since we may 
need enable or disable (DINT) global interrupt quite 
often when implementing critical sections, these 
emulated instructions make programming neat and 
simple. Moreover, in a loop implantation, increment 
loop variable by one or two may use the emulated 
instructions “INC dst” or “INCD dst,” which greatly 
improve the program readability. Therefore, the 
emulated instruction is closer to an alias of a native 
instruction with some designated operands. 
Functionality-wise, it is similar to macro in C 
language. In some text they also are called pseudo 
instructions. There are 23 emulated instructions 
defined in MSP430 as listed in . These instructions 
are classified as follows: 


¢ SR manipulations: CLR C, CLR N, CLR Z, SET C, 
SET N, SET Z, DINT, EINT 

* Increment/Decrement by one or two: INC, 
INCD, DEC, DECD 

¢ PC control: BR, RET 

* Stack manipulations: POP 


* Left shifters: RLA, RLC 
Others: ADC, DADC, INV, NOP, SBC, TST 


Table Emulated Instructions Defined in MSP430 


Tw .-14. 4 
LLIULaALeU 


ADC.x dst 


BR dst 


CcCTDC 


NLL tw 


CLRN 


CTD?7 


Wit 


DADC.x dst 


NEC w Act 
Vea Uve 


DECD.x dst 


DINT 


GIN'T 


Quhivae 


TNIC w dot 
ALVA USE 


INCD.x dst 


TNIV w dot 
MAvt 


LV V oad 


naaoat 
Ac tual 


ADDC.x #0,dst 


MOV dst,PC 


RIC #1 CD 


wyiuw ii bgvir 


BIC #4,SR 


RIC #9 CD 


wyiu ii Sigur 


DADD.x #0,dst 


SUR. sz #1 Act 


tyghive 


SUB.x #2,dst 


BIC #8,SR 


RTIc #90 CD 


wails fi VYovit 


ANN ~ #1 Aat 


fabri J) Lyte 


ADD.x #2,dst 


VOD ~ #1 Act 


ZANTE tyuve 


Description 
Add carry to 


A i i 
anctinatinn 
Uva 


Branch to 


A i i 
anctinatinn 
Uvvu1Uda 


Claaw nares hit 
Meus CULL y vit 


Clear negative 
bit 


Claae vara hit 
Nb eun oviyv vit 


Decimal add 
carry to 


A i i 
anotinatinn 
Uva 


MNanramoant 
PURE 
Double 


dAn 


decrement 
Disable 


intaren am345 
dtatearupe 


Euohla intarrsiinta 
BU swe i Upto 


Tnaramant 
oe hes oo eo 8 Oe a 


° 
Iinnaramoant 
coe oes Oe ee hoe © 8 
Tntrart 
adbvyune 


NIAD MaW +H’ D2 Nin anoara 


n tin 
Ivwe. ‘ZV W Voeitv LNWMS vet autoit 


DAD dot MOV SP dct Pees fram ectanl, 
iL we. Ave iv 1 gute c itwu A2V111 VJLULIA 
RET MOV @SP +,PC Return from 
subroutine 
RLA.x dst ADD.x dst,dst Rotate left 


arithmetic (shift 
left 1 bir 


RLC.x dst ADDC.x dst,dst Rotate left 
through Parris 
“ett bata A 
SBC.x dst SUBC.x #0,dst — Subtract borrow 
(1-carry) from 
destination 
SETC RIC #1 C Cat carry © hit 
wWwiidk wily fi bgvir weet UL at 
CLETNI RIC #A CD Cat nerally roa hit 
wWweitiv wily fi Pogue Wee 11 UMeivwe wit 
CLT7 RTC 9 cD Cat varn hit 
waite wily ii aiguir Wet. Orwliwy vit 
TST.x dst CMP.x #0,dst — Test destination 


Functionality of Instructions 


Instructions may be classified by their functionality. 
Generally, they can be categorized into data 
movements, dyadic (binary) operations, monadic 
(unary) operations, and flow controls. Data can be 
moved from a register to another register, from a 
register to a memory location, from a memory 
location to a register. Dyadic instructions operate 
upon two operands. Thus, they are called also called 
binary operations, such as arithmetic, logical, 
floating point instructions. Monadic instructions 


operate on single operand, such as shift, rotation, 
logical NOT instructions. Flow control instructions 
alter the program execution flows such as 
subroutines, interrupts, branches, unconditional/ 
conditional jumps. 


Data Movement Operations 


Data movement instructions are in charge of moving 
data between memory and register, between register 
and register, or between memory and memory. Most 
RISC architectures require instructions to load data 
from memory to register, and to store data from 
register to memory, as their instructions are 
designed to operate upon registers. Stack 
manipulations such as push and pop belong to the 
data movement category because the stack is 
allocated in memory and the data to be stored or 
retrieved is in register. A typical assembly 
instruction for data movement will have the 
following format: 


Table Move Instruction Format 
MOV src, dst ;move data from src to dst 


In some assembly languages, the source and the 
destination may be swapped. So be careful about the 
operand order when working with this move 
instruction. The default operation of the move 


instruction is to move one word from the source to 
the destination. Therefore, it is important to make 
sure the width of the operands is a word. The 
default instruction can also be written as MOV.W to 
explicitly designate the operand width. If you just 
need to move one byte over, the operand width may 
be specified by the instruction such as MOV.B. 
Applying MOV.B on word size operands will move 
the low byte, but applying MOV.W on byte operands 
will cause errors as there is not enough data to be 
operated upon (data width mismatched). 


Move Data from Register to Register 


Almost all architectures provide instructions for 
moving data from a register to another. If a register, 
say R4, stores a temporary result, but we need to 
use R4 for something else. The value stored in R4 
will have to be moved to somewhere else. 
Otherwise, the temporary result will be erased 
should R4 is used for other computations. In this 
situation, for example, a move instruction as shown 
in will store R4’s value in R15. After that statement, 
R4 is released and may be used for other 
computations. 


Table Move Data from Register to Register 


MOV R4, R15 ; store R4 in R15 


Set Values in Registers 


In some scenarios, we may need to initialize a 
register with some value to start with. A typical way 
of doing this will be set a register a value using the 
move instruction. The value is specified by a pond 
sign (#) followed by a number. The number may be 
attached with a radix notation. In the IAR system, 
the binary, the octal, the decimal and the 
hexadecimal are designated by b, o (or q), d, and h. 
For example, shows ways of setting a register a 
value. 


Table Setting a Register a Value 


MOV #00010010b, R4 ; set R4 to binary 00010( 
MOV #12340, R4 ; set R4 to octal 1234 

MOV #1234d, R4 ; set R4 to decimal 1234 

MOV #1234h, R4 ; set R4 to hexadecimal 1234 


To set a register some common value such as -1, 0, 
1, 2, 4, and 8, may used a special register R3, called 
constant generator in MSP430. This will be 
discussed in a later section. 


Stack Operation 


Stack is an important data structure in computer 
systems, and its operations follow the first-in-last- 
out pattern. A stack pointer, a special register in 
CPU, holds the address of the top element in the 
stack. Therefore, the push and pop instructions will 
not be supplied with the address of a particular 
element in the stack. 


Table Stack Operations 


PUSH dst ;push data onto stack 
POP dst. -pop from: stack 
; MOV @SPt, dst 


Assembly instructions to operation stack are 
illustrated in . The push instruction will place the 
data in the stack at the slot pointed by the stack 
pointer (SP), and adjust the stack pointer to the next 
available slot. The reverse operation will retrieve 
the top entry in the stack and adjust the stack 
pointer accordingly. Note that in MSP430, the pop 
instruction is emulated by the move instruction. 


Direct Memory Access 


Since memory access time is about a magnitude 
slower than that of registers, it is sometimes a need 
to move multiple words at a time. In order to 
release CPU from waiting for the long memory 
accesses, a direct memory access (DMA) mechanism 


is designed to move data on a side. This requires a 
DMA controller in charge of the data movement 
without intervening the CPU. So the CPU may still 
perform its computations. To begin with DMA 
process, we have to provide information about the 
data such as where to begin, the number of words 
(bytes), etc. Once started, DMA will move the data 
accordingly until finish the task. DMA will notify 
CPU with an interrupt to report the data movement 
status. The use of DMA will improve system 
performance, and save power consumption in low 
power design. For example, moving data from 
ADC12 to RAM with DMA will allow CPU to remain 
in sleep mode in MSP430. 


Most instructions will modify the status register, but 
the movement instructions will not affect it. 


Move Data from Memory to Memory 


Some CPU does not provide direct memory copy 
from one location to another. In MSP430, indexed 
addressing on both source and destination operands 
allows moving data from memory from one location 
to another. illustrates how to move data from 
memory to memory in MSP430. The source address 
stored in R4 is the location in memory that contains 
data to be copied to the location stored in R5. 


Table Move Data from Memory to Memory 


MOV 0(R4), O(R5) 


Move Data between Memory and 
Registers 


Most CPU provides instructions to move data from 
memory to register, or vice versa. Typical 
mnemonics used are load and store. In MSP430, this 
data movement is done by index addressing on one 
of the operands. The index addressing provides a 
mechanism to indicate an address in the memory. 
For example, shows move data in-between memory 
and register. 


Table Data Movement Between Memory and 
Register 


MOV R4, O(R5) ; move data from R4 to memory 
; address stored in RD5 

MOV O0(R4), R5 ; move data from memory 
;address stored in R4 to R5 


Dyadic Arithmetic Operations 


An instruction operates on two operands are called 
dyadic or binary operations. Dyadic operations 
include arithmetic, logic, and floating point 


instructions. lists dyadic arithmetic instruction in 
MSP430. Note that an asterisk prefixed to an 
instruction indicates an emulated instruction. 


Table Dyadic Arithmetic Operations 


oe er io Anas Pa bp en ns pe 
auULUL. VV OL, USL GQUUILIUII, Use TT — div 
addc.w src, dst add with carry, dst + = 
Carn ES ry) 
que 1 Ly 
*kadAn TAT aAct ada Rarrir dAaot — nr 
ULwe VV LIVE UAL beh aceceey A) Ave i =? 
oauh TAT OT dAat ounhteantinn Aaot —_— arn 
Wile vy vies Nivt VUEYVULUELLLY IL, Avett viv 
subc.w src, dst sub with borrow, dst -= 
Carn ES _n\ 
qouLe 1 Ls 
*kahn TAT Act ouh | ere hat dAaot —_— --n 
UWee VV WAVE VUY YvIiltvvy Wits AV = 
cmp.w src, dst compare, set flags (dst — 
sre) 
add.w src, dst addition, dst + = src 


Dyadic Logical Instructions 


Logical AND and XOR instructions are provided in 
MSP430. There is no explicit OR and NOT 
instructions. The logical OR operation can be done 
using the bit set “bis” instruction but it does not 


affect the SR register. Other than that the bis 
instruction may be a substitute for logical OR. The 
NOT operations may be implemented by XOR’ing 
OxFFFF with a destination. For example, the 
statement XOR #0FFFFh, dst will negate each bit in 
the dst. In fact, the emulated INV.W is performing 
the above XOR instruction for logical NOT 
operation. lists the dyadic logical instructions in 
MSP430. Note again the bis and bic instructions do 
not affect the SR registers. 


Table Dyadic Logical Instructions 


Pe mm Aas Soe Anz QO A127 
GLLUAE VV OLY, WOOL VELLVVEDOU GALIU, Udt &— DLL 
waritararn Aat Hitvarian war dat ~*— arn 
AULeEVV UL, UE VLIVVLOY AVL, UDE — UL 
Wit var ann dot Hitvarian tact (Act Qarr) 
WL VV VEL, UVLE VLIVVLOY LOL, LYLUEL MoLLy 
Hie var arn dat hit ant dot | — arn 

WLI VV VE, ULE ks ee 

bic.w src, dst bit clear, dst &= ~dst 


The first three instructions in affect the SR register 
in a normal way. The Z bit is set if the result is zero. 
In MSP430, the carry bit is set to the negation of the 
Z bit for these instructions. The following examples 
show how the SR is changed for these operations. In 
(a), the Z flag is set because the result is zero after 
INV (XOR) instruction is executed. The V flag is also 


set because the value store in R4 is changed from -1 
(OxFFFF) to 0. Since 0 is considered as positive, the 
sign of the value is changed. Another example 
shown in (b) depicts the flags V and C are set after 
the execution of the INV instruction. In this case, 
the sign of the values stored in R4 is changed as 
well. Since the result (OxOOFF) is not zero, the Z flag 
is not set. So the carry is set due to C= ~Z for the 
logical instructions. Here the meaning of the carry 
bit is not its usual one. Instead, it simply indicates 
the result is not zero. 


Table Examples for the SR Change for Logical 
Instructions 


(a) 
mov.w #Oxffff, R4; set R4 to Oxffff, SR 
(b) 


mov.w #0xff00, R4; set R4 to Oxff00, SR 


Monadic Operations 


Operations that require only one operand are called 
monadic or unary operations. Typical instructions 
include shift and rotate instructions. tabularizes 
unary instructions in MSP430. Note that the 


instruction test.w is a special case of the instruction 
cmp. 


Table Unary Instructions in MSP430 


ke nT --- Ans PP Ana 
GIL. VV UDL LLUaL Use — o 
* Ann var Act dnnram ant dct 
UMcCLevvY UE MBE Leet UvoeL 
*decd.w dst double dst-=2 
danramant 
MEE 
xine var Act inaramant dct. 1 
dUevV Wve oe his or ee hore a voi 1 
*anad war Act dAaiathla inavramerntdot_L — <a 
ACU. VY UWL MYVUUVLE LtELELLLELLLUDL 
1. 
*tst.w dst test with 0 (dst — 0) 


Decimal Operations 


MPS430 provides one native instruction to handle 


binary-coded decimal (BCD) operands. For example, 


the operation 9+ 1=10 can be computed as 
depicted in . Note that the sum is encoded to 0x10 
not the hexadecimal OxA. 


Table BCD Additions 


mov.w #0x9, R4 ;R4 keeps Ox9daddw #0x1, 


R4 


There is an emulated instruction that adds the carry 
to a BCD number. shows the two BCD instructions. 


Table BCD Instructins in MSP430 


dadd.w src, dst decimal add with carry, 
*dadc.w dst decimal add carry, dst 
+=C¢ 


Byte Manipulations 


Most instructions in MSP430 have byte versions. For 
example, instead of moving a word (two bytes) to 
destination, “MOV.B” will just move the low byte 
over. Additionally, there are two instructions 
designed to manipulate bytes in a word. lists the 
byte manipulating instructions. The instruction 
swpb will exchange the lower bytes and the upper 
byte of the operand. The instruction sxt will perform 
sign extension. 


Table Byte Manipulation Instructions 


swpb src; swap upper and lower bytessxt src; extend 
sign of lower byte 


Examples of byte manipulating instructions are 
illustrated in . The swpb instruction is 
straightforward, and it simply swap the lower byte 
and the upper byte of the register R4 as shown in 
(a). The sign extension is needed to promote a byte 
data to a word. In the example depicted in (b), the 
bit 7 of R4 is 1, which indicates the lower byte is a 
negative value. The sxt instruction will fill in the 
upper byte of R4 with one’s to preserve the sign 
after promotion. In this example, the upper byte will 
be overwritten with the value OxFF. 


Table Examples of Swap and Sign Extension 
Instructions 


(a) 


mov.w #O0xabcd, R4; R4 is Oxabcdswpb R4; R4 : 


(b) 


mov.w #0xab8d, R4; R4 is Oxab8dsxt R4; R4 i: 


Bit Operations on SR 


The instructions that operate upon each individual 


bit of the SR register are listed in . Since there is no 
native bit manipulating instructions, these are all 
emulated instructions using logical instructions. The 
carry bit may be involved in operations such as adc, 
addc, sbc, subc, dadc, dadd, rlc, and rrc instructions. 
Therefore, the clrc and setc instructions will 
normally operate along with those instructions. For 
example, to right shit one bit for an operand, a clrc 
is used to clear the carry bit, followed by a rrl 
instruction. If the carry bit is not cleared, it goes to 
the MSB of the operand, which may result in an 
unwanted outcome 


Table Bit Operations on SR 


e212. mlnwanwe 22aee- Lita M__M 
Lu LLlvalL cally Vil, Uw—”vU 
*kalew elaav warn hit 7T—N 
N14 V4NUL OLY Vily or U 
*kalen elaawv naaatinrn hit T—A 
Manas VANE LIU HULL VE Wats iv 
*dint disable general interrupt, 
crv—n 
Ui vu 
Kanta ant nares hit C1 
VELL vel vury <es UO 4 
Kant ats warn hit | 
veil Vel ovliyu Dit: fom a 
*Kantn ant noaaatitrn hit NI —1 
veul VUE LAU HUlLVe Vail, iv 4 
*eint enable general interrupt, 


GIE=1 


Flow of Control 


Subroutines, interrupts, branches, jumps instructions 
alter program flow of execution. In the CPU level, 
the program counter (PC) keeps the address of next 
instruction to be executed. Therefore, a change in 
the PC will actually alter what instruction is being 
executed. The subroutine calls, branches, and jumps 
instructions essentially modify the content of PC 
accordingly. 


Assembly Programming - Part 1 

This chapter introduces MSP 430 assembly 
programming language such as instruction formats, 
addressing modes, subroutine call and return 
mechanisms (cross-reference PL/Language 
Translation and Execution), I/O and interrupts, and 
Heap vs. Static vs. Stack vs. Code segments. Explain 
the organization of the classical von Neumann 
machine and its major functional units. Describe 
how an instruction is executed in a classical von 
Neumann machine, with extensions for threads, 
multiprocessor synchronization, and SIMD 
execution. Summarize how instructions are 
represented at both the machine level and in the 
context of a symbolic assembler. Demonstrate how 
to map between high-level language patterns into 
assembly/machine language notations. Explain 
different instruction formats, such as addresses per 
instruction and variable length vs. fixed length 
formats. Explain how subroutine calls are handled at 
the assembly level. Explain the basic concepts of 
interrupts and I/O operations. Write simple 
assembly language program segments. Show how 
fundamental high-level programming constructs are 
implemented at the machine-language level. 


MSP430 Assembly Programming 


1. System Organization 


This section discusses the basic components in a 
computer system such as the CPU, memory, I/O, 
and the buses that connects them. Understanding of 
the system organization is a key to develop a high 
performance system. For example, data stored in 
registers can be retrieved much faster than stored in 
memory. Therefore, put data that are frequently 
accessed in register in a program will run much 
faster than would have been stored in memory. 


1. Basic System Components 


A typical computer system is composed of a CPU, 
memory, I/O, and bus. The bus is used to connect 
other components together. Most of the 
computations happen in the CPU. Programs and 
their data are stored in memory. Input/output 
devices are used for data communication between 
the computer system and the outside world, such as 
disk drives, keyboards, network cards, etc. Figure 1 
illustrate a basic computer system. This type 
computer system is called Von Neumann machine 
and most of the computer systems nowadays follow 
this architecture. Note the memory is volatile, i.e., it 
keeps data while the power is supplied but it lost 
the data when the power is off. Therefore, we need 
to keep the information via the I/O systems such as 
a harddrive for data storage. 


CPU | 


WO 


Figure 1 A Basic Computer System 


1. System Buses (data bus, address bus, control 
bus) 


Components in the computer system are inter- 
connected via system buses. A bus is a set of wires 
that electronic signals may travel over. System buses 
include data bus, address bus, and control bus. 


Data buses are used to transfer data from a 
component to another in a computer system. The 
width (number of wires, or number of bits) of the 
data bus varies from CPU to CPU. Typically, the 
wider, the higher bandwidth data may be 
transferred. The bus width decides the number of 
bits that can be transferred from one component to 
another in a transaction. Normally, we use the 
width to the data bus to classify the “size” of CPU. 


For example, a 16-bit CPU means its data bus width 
is 16 bits. However, the actual “size” of a CPU 
should be defined as its processing power in terms 
of the size of operands in an instruction. For 
example, a 16-bit CPU will add two 16-bit numbers 
in its ADD instruction. 


Address bus is used to indicate a specific location in 
a component, e.g., memory, where the data will be 
accessed. The width of the address bus specifies the 
range of the addressing space. An 8-bit address bus 
will designate 256 different locations from 0 to 255 
whereas a 16-bit address bus will specify a much 
wider range from 0 to 65535. The number of 
locations is limited by the width of an address bus. 
The memory capacity, for example, is dictated by 
the number of locations multiplied by the size of 
each location. If the size of a location is one byte, it 
is called byte addressable. If the size of a location is 
a word, assume a word is four bytes in a system, it 
is called word addressable. In a byte addressable 
memory module with an 8-bit address bus, the 
capacity of the memory will be 256 bytes. In the 
same setting, the capacity of the memory will be 
1024 bytes because each location is corresponding 
to one word, i.e., four bytes. Obviously, a word 
addressable system will specify a higher capacity 
memory module. In light of slow memory 
transaction compared to registers, a word 
addressable system will get four-byte data in a 
memory transaction, which is four times more than 


a transaction in a byte addressable system. 


Control bus is in charge of sending control signals to 
synergize all components in computation. The 
source of control signals is derived from 
instructions. For example, a multiplexer in CPU is 
used to select an operand either from register or 
immediate value that comes with an instruction. 
The control signal to the multiplexer will be 
generated based on which instruction is in 
execution. If, for example, an ADD instruction is 
executed, the control signal will be generated to 
select a register operand. Should ADDI be executed, 
the control signal will be generated to select the 
immediate value from the instruction. Moreover, in 
write-back stage, a suitable control signal should be 
generated either for a write-enable to memory, or a 
write-enable to a register. Normally, the control 
signals are generated from a control unit which is 
composed of a decoder for instructions. The decoder 
will indentify instruction types, and the control unit 
will generate signals to control all components to 
perform the instruction in execution. 


1. Memory Subsystem 


Memory is used to keep data for a program in 
execution, or a process. The data include program 
machine code, data, heap, and stack. For a 
convenient management purpose, we may divide 
memory into segments, and one for each type of 


data. They are called code segment, data segment, 
heap segment, and stack segment. The code segment 
contains instructions in the program, which will be 
executed accordingly. The data segment contains 
global data which will be used by the whole 
program. The heap segment is an area that is 
dynamically allocated by the program. The stack 
segment contains local variables, parameters for 
procedures and functions, and activation records 
that keep track of subroutine calls. 


There are two operations for memory which are 
read and write. In a read operation, CPU has to 
provide an address to the memory, and other 
control signals if any. The content of the memory at 
that location will be read out. Figure 2 illustrates a 
memory read to CPU. After the CPU provides the 
address, the memory will send the data at that 
location back to CPU. A memory read may not 
require extra signals other than the memory 
address. 


CPU 


Pav 


Figure 2 A Memory Read from Memory to CPU 


Similarly, in a write operation, CPU has to provide 
an address, data to be written, and control signals if 
any, such as write enable. The provided data will be 
stored in the designated location of the memory. 
Figure 3 illustrates a typical memory write. Most of 
the memory systems require a write enable (WE) to 
trigger a memory write. The actual write occurs on 
a clock event, either rising edge or falling edge. 


WE ~ 


Figure 3 A Memory Write from CPU 


The size of the data written or read depends on the 
memory design. In a byte addressable system, each 
read or write will transfer one byte data. In a word 
addressable system, each read or write will transfer 
four bytes. In a word addressable system, access to 
any of the bytes therein will have to be performed 
by other instructions. For example, a logical AND 
instruction will retrieve the second byte (count from 
zero) in a word as follows: 


AND Rl, R2, f££0000h; The second byte will be 
; and all other bytes will be 
; zeroed out. 


1. Read-only Memory (ROM) 


Data stored in a code segment are typically accessed 
by CPU to fetch instructions for execution. In very 
rare cases is the program code modified. Therefore, 
to avoid altering the code, the memory is marked as 
read only. That means the data stored in read-only 
memory (ROM) can only be read out, and can not 
be changed. There are devices used to implement 
ROM such as erasable programmable read-only 
memory (EPROM) and electrically erasable 
programmable read-only memory (EEPROM). The 
initial data may be burned to an EPROM by 
supplying a higher voltage than normal to its inner 
floating gate transistors. EPROM will retain data 
even its power is switched off. EPROM may be re- 
programmed after exposing it to strong ultraviolet 
light. EEPROM shares the same characteristics as 
EPROM except EEPROM uses electricity to program 
and erase data in it. Flash memory such as USB 
memory drive is developed from EEPROM. These 
types of memory are used to keep static data such as 
device configuration information, and program 
code. They are non-volatile. 


Microcontrollers such as Intel 8051 and TI MSP 430 
integrate EEPROM in a single chip. Normally, the 


size of ROM ranges from several kilo bytes to 
several hundred kilo bytes. The original 8051 chip is 
equipped with 4 KB ROM for program memory. The 
MSP430F2274 comes with 32 KB ROM for program 
memory. In the 8051 chip, what if a program 
exceeds 4 KB? In that case, external memory 
modules will have to be added. Hopefully, there is a 
16-bit address but in 8051. So it will address up to 
64 KB! MSP430 has a 16-bit address bus as well. 
Some MPS430 extends the address bus to 20 bits 
and allows addition ROM space above 0x10000. 


1. Random Access Memory (RAM) 


Random access memory (RAM) is a type of 
computer memory, which is used to store process 
data. The worse case time to arbitrarily access any 
data in RAM is bounded by a constant. Data store in 
RAM will be lost should the supplying power is 
switched off. By and large, there are types of RAM 
such as static RAM (SRAM) and dynamic RAM 
(DRAM). DRAM employs a capacitor to store 1-bit 
information. The logic high is represented by the 
charged capacitor whereas the logic low is 
represented when the capacitor is discharged. The 
capacitors are small and may be abundantly built in 
an integrated circuit. However, the charge of the 
capacitor leaks gradually. So DRAM has to refresh 
the capacitor charge periodically to keep the bit 
information. This is the reason why “dynamic” is 
used in its name. SRAM, on the other hand, stores 


one bit information in its bi-stable latch, which does 
not need to be refreshed. The latch requires more 
silicon space in the chip and thus the capacity of 
SRAM is typically smaller than that of DRAM, but 
SRAM is faster. 


Modern microcontrollers have on-chip RAM, i.e., 
integrating RAM and CPU in a single chip. The 
capacity of the integrated RAM ranges from several 
hundred bytes to server kilo bytes. Intel 8051, for 
example, comes with 128 bytes RAM internally. 
MSP430 family has up to 8 KB of RAM. Should a 
program require a larger storage from RAM, 
external memory modules would have to be added. 
The width of the address bus in 8051 is 16-bit. So 
the maximal addressable memory space is 64 KB. 
The memory address bus (MAB) in MSP430 is a 16- 
bit wide bus. Therefore, the addressable range is the 
same as Intel 8051. 


1. Input/Output Subsystem 


Input/Output (I/O) is one of the required 
components in a computer system. Its purpose is to 
get data to be processed in system and transmit 
processed data to the outside of the system. 
Consider this scenario: a program is asking for a 
choice from a menu before continue its execution. 
The system will first output a menu, and prompt a 
message say “please enter your choice.” The 
operator enters a key corresponding to the choice. 


In this scenario, the output data from the system are 
the menu and the prompt, and the input is the 
choice in terms of a key. There are input devices in 
a system such as keyboard, mouse, DVD ROM, 
microphone, webcam, etc. There are output devices 
such as monitor, speaker, etc. There are devices 
which perform both input and output functions, 
such as hard drive, floppy diskette, memory, 
network interface card, etc. 


Like the memory subsystem, each I/O device is 
assigned an address, which typically is the starting 
address of its allocated address space. The address 
space may cover, e.g., control registers, data 
registers, status register, etc., in the I/O device. The 
CPU accesses I/O devices as if they were memory. 
There may have several address spaces in the 
system, one for RAM, one for ROM, one for I/O, etc. 
A Harvard architecture has several memory spaces, 
each of which starts from zero. A Non-Harvard 
architecture, on the other hand, has a single 
memory space for all the allocated address spaces. 
Intel 8051 follows the Harvard architecture. It has 
internal data memory space, program code memory 
space, and external data memory space. They all 
start from zero. MSP430 adopts the non-Harvard 
architecture, i.e., there is a single memory space 
starting from zero. The advantage of non-Harvard 
architecture is that each address in the system 
uniquely identifies a location, either a memory 
location or an I/O device. In a Harvard architecture, 


an address is not unique. Therefore, there must have 
a mechanism to determine which device is to be 
referred. In Intel 8051, the special designed 
instructions are used to designate a specific device. 
For example, the MOVX (move eXternal) is used to 
transfer data from external memory to CPU, 
whereas MOV is used to get data from the internal 
date memory. 


1. System Timing 


CPU is complex sequential logic, which means some 
of its components involve registers or flip-flops. The 
sequential logic requires an event of a clock to 
perform its function. A clock is an alternate signal 
between 0 and 1 continuously. There are two clock 
events: rising edge and falling edge. Figure 4 
illustrates a clock with rising edge, falling edge, and 
period. The rising edge is an event when the clock 
changes its value from logic low to logic high. 
Similarly, when the clock changes its value from 
logic high to logic zero, it is a falling edge event. 
The clock period measures the time for one clock 
cycle. The reciprocal of the clock period is clock 
frequency. The higher frequency, the shorter period 
a clock has. A sequential logic may be designed to 
change its state following either event. 


Rising Edge |, __ Period 


Falling Edge 


Figure 4 A Clock 


A CPU also has a big portion of combinational logic, 
such as address decoding, ALU, and the like. The 
combination logic does not involve clock, and it 
requires some time to perform its function. The time 
is measured by the longest propagation delay from 
its input to its output. The propagation delay is the 
accumulated gate delay in the critical path of the 
combinational logic circuit. There is a relation 
between the propagation delay and the clock 
frequency. A rule of thumb is that the clock period 
(determined by frequency) should be slightly longer 
than the propagation delay as depicted in the 
following example. Recall that in a memory write 
operation, the CPU has to provide three pieces of 
information: address, data, and WE. Assume that the 
memory is designed to update its content on a rising 
edge. Figure 5 illustrates a timing diagram for a 
memory write operation. The address, data, and WE 
signals have to be stable by the coming rising edge, 
which the memory update occurs. By stable, we 
mean the device that generates these signals should 
not be in its transit state. Otherwise, the result is 


non-deterministic. In other words, the longest 
propagation delay for generating address, data, or 
WE should be shorter than the clock period. 
Furthermore, the clock period may be determined 
by the longest propagation in a system. 
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Figure 5 A Timing Example of a Memory Write 
Operation 


The higher clock frequency, the better! Therefore, a 
design concern is to shorten the propagation delay. 
However, it is inevitable to have long delay in some 
components such as floating point adders. A typical 
solution is to chop the design into several pieces in a 
sequential manner, and link them with registers to 
hold temporary results. This is called pipelining. By 
applying this approach, the system clock may be 
maximized and typically attain a better 
performance. 


1. Registers 


Registers are data storage and built inside a CPU. 
They are not like memory, which is typically outside 
the CPU chip. Registers are used to keep data for 
computation including operands and results. 
Accesses to registers are very fast and typically are 
within one clock cycle. They are built from flip-flops 
and require clocks for operations. Normally, 
registers are used to keep variables in a program. 
The instructions in the program are then executed 
among the variables, i.e., the registers. 


Registers in CPU may be classified to general 
purpose registers (GPR) and special function 
registers (SFR). General purpose registers keep data 
for computation, and its results. Special function 
registers include program counter (PC), status 
register (SR), stack pointer (SP), and others. PC 
keeps the address of next instruction to be executed. 
SR stores a number of flags after an instruction is 
executed such as carry bit (C), zero bit (Z), and 
others. If the result of an executed instruction is 
zero, the zero bit in the SR will be set. Stack pointer 
stores the address of the top element of the stack in 
a system. 


In MSP430, registers include 16 SFRs, RO (PC), R1 
(SP), R2 (SR), and R3 (CG), and R4-R15 are general 
purpose registers. Each of these registers is 16-bit 
wide. RO is the program counter, R1 is the stack 


pointer, R2 is the status register, and a constant 
generator, and R3 is a constant generator. Both R2 
and R3 may be used to generate small constants 
such as 0, 1, 2, 4, 8, etc. All of the twelve registers 
form R4 to R15 can be used as data registers, 
address pointers, or index values and can be 
accessed with byte or word instructions. The 16 
SFRs are allocated in the lower 16 bytes (00-OFh) 
memory space. For example, I/O ports are 
associated with SFRs. If we want to output data via 
an I/O port, we may just write data to the 
associated SPR. Similarly, if we need to get data 
from outside, we may just read from the SFR 
associated with the input port. 


1. MSP430 Instruction Set 


An instruction set of a CPU dictates what a CPU can 
do in a machine cycle. A program written in any 
program language will eventually be converted to 
machine code based on the instruction set. Machine 
code or op code refers to the code a CPU may 
execute. It contains information about what 
operation is to be executed over what operands. In 
terms of architectures, basically, there are two 
models: CISC (complex instruction set computer) 
and RISC (reduced instruction set computer). CISC 
implements op code with a variable length, which 
requires complex instruction decoding. The x86 
architecture manufactured by Intel and AMD 
belongs to CISC. On the other hand, RISC 


implements fixed length instructions, and thus its 
instruction decoding is simple. Manufactures such as 
IBM, Apple, and Sun, produce RISC CPUs. The latest 
CPU design such as MSP430 is somewhere in- 
between CISC and RISC. There are only 27 
instructions in MSP430, among which 12 type 1 
instructions, 7 type 2 instructions, and 8 jump 
instructions. 


1. 


Type 1 instructions: 


MOV — move source to destination 

ADD -— add source to destination 

ADDC - add source and carry to destination 
SUBC -— subtract source from destination with 
carry 

SUB — subtract source from destination 

CMP — compare (pretend to subtract) source 
from destination 

DADD - decimal add to destination with carry 
BIT — test bits of source AND destination 

BIC — bit clear (destination := ~ source) 

BIS — bit set (logical OR) 

XOR — exclusive OR source with destination 
AND - logical AND source with destination 


. Type 2 instructions 


RRC - rotate right one bit through carry 
SWPB - swap bytes 
RRA - rotate right once bit arithmetically 


1 


SXT — sign extend byte to word (2 bytes) 
PUSH - push value onto stack 

CALL - subroutine call; push PC and move 
source to PC 

RETI — return from interrupt; pop SR, then pop 
PC 


. Jump instructions 


JNE/JNZ — jump if not equal/zero 
JEQ/JZ — jump if equal/zero 

JNC/JLO — jump if no carry/lower 
JC/JHS — jump if carry/higher or same 
JN — jump if negative 

JGE - jump if greater or equal 

JL — jump if less 

JMP - unconditionally jump 


. Instruction Encoding 


An instruction in MSP430 is encoded in 16 bits. The 
fixed instruction length facilitates instruction 
decoding and design. This probably is one of the 
reasons that MSP 430 belongs the RISC camp. 
Information encoded includes opcode, source 
register, addressing modes for source and 
destination operands, operand width (byte or word), 
and destination register. The opcode specifies what 
operation to perform. Both source and destination 
addressing modes may be specified. Figure 6 shows 
the instruction encoding in the MSP 430 


architecture. As can be seen that the higher 4 bits 
are used to specify an opcode, followed by 4 bits for 
a source register, followed by one bit for destination 
addressing (da), followed by one bit for bit or word 
operands, followed by two bits source addressing 
(sa), and the lower 4 bits indicate the destination 
register. 


opcode src da b/w sa dst 
MS Be is isteieisio 
Figure 6 Instruction Encoding in MSP 430 


Most of the MSP 430 instructions have two versions, 
either working on byte operands or word operands. 
This operand width information is kept in the b/w 
bit of the instruction. In assembly programming, a 
postfix “.W” indicates a word operation, whereas 
“B” represents a bye operation. There are two bits 
for source addressing bits but only one bit for 
destination addressing. Therefore, in this design, the 
source addressing is much more versatile than the 
destination. We will revisit addressing modes later. 


Theoretically, any instruction encoding would work. 
However, to facilitate instruction decoding, a better 
arrangement is required. An opcode 1 indicates type 
2 instructions, opcodes 2 and 3 are for jump 
instructions, and opcodes 4 to 15 are for type 1 
instructions. Since there are 7 type 2 instructions, 
the actual opcodes for them are encoded in the 7th, 
8th, and 9th bits. 


1. The MOV instructions 


The widely used data movement instruction is the 
MOV instruction. MOV does not actually move data 
from one place to another. Indeed, it leaves a copy 
of data in the source operand. The syntax of the 
MOV instruction in MSP430 is as follows: 


MOV src, dst 


where src indicates the source operand, and the dst 
is the destination operand. Note that the order of 
the operands depends on the design of an assembler. 
In some assemblers, src and dst may be swapped. 
The following statement will move (copy) data 
stored in R4 to R5. 


MOV R4, RS 


If R4 contains 100, R5 will have the value 100 after 
the statement is executed, and R4 still holds 100. 
The value stored in R4 is not removed after the 
MOV instruction is executed. The MOV instruction 
in MSP430 may transfer data from memory to 
memory, from memory to register, from register to 
memory, and from register to register. Since most 
RISC CPUs are not allowed to transfer data from 
memory to memory in a single instruction, MSP430 
is not pure RISC in this regard. 


1. Addressing Mode 


Address modes are the ways CPU gets operands for 
an instruction. MSP 430 supports a wide range of 
addressing modes including index, symbolic (PC 
relative), absolute (&), indirect register (@), indirect 
autoincrement (@+), and immediate (#). 
Theoretically, there are 8 addressing modes in MPS 
430 because it uses one bit and two bits to encode 
addressing modes, respectively. Bits 4 and 5 indicate 
source addressing, and bit 7 indicates destination 
addressing. 


There are four basic addressing modes for the source 
operand. They are listed as follows (the leading 
digits are the 5 and 4 bits in an instruction): 

* 00: register direct 

¢ O1: indexed addressing 

* 10: register indirect 

¢ 11: indirect autoincrement 
There are two addressing modes for the destination 
operand as listed follows (the leading 0 is the 7th bit 
in an instruction). 

* 0: register direct 


¢ 1: indexed addressing 


1. Register Direct Addressing 


Register direct addressing specifies operands in 
registers. In MSP 430, all 16 registers may be used 
in register direct addressing mode. Let’s use MOV 
instruction in MSP 430 to illustrate this addressing. 


MOV.W R4, R5; move the value stored in R4 t« 


Both the source (R4), and the destination (R5) 
operands are registers. Therefore, they are both 
register direct addressing modes. The above MOV.W 
instruction moves one word data stored in R4 to R5. 
The “.W” postfix indicates word operation. Register 
direction addressing mode is the basic addressing 
and it has been widely used in CPU design. If you 
are not familiar with addressing modes of a CPU, 
you may find register direct addressing simple and 
useful. Since this instruction is one word long, after 
its execution, PC is increased by two. In other 
systems, the increment of PC depends on the length 
of the instruction. 


R3 in MSP 430 is a constant generator. When CPU 
reads R3 in register direct mode, it will get zero. So 
the following instruction is typically used to 
initialize a register. 


MOV.W R3, R5; initialize R5 to zero 


The SR in MSP 430 can be either source or 
destination operand. If SR is the source operand, its 
content will be read out in register direct mode. If 
SR is placed in the destination, it means its content 


will be set to some value. 


What if you want to move one byte data over? MSP 
430 provides byte operand instructions with the 
postfix “.B” to the MOVE instruction. The following 
will move one byte data from the low byte of R4 to 
Ro. 


MOV.B R4, R5; move low byte of R4 to R5 


The byte date will be from the low byte of R4 and 
written to the low byte of R5. What happened to the 
high byte of R5? For example, we set some specific 
value to R4 and R5 as follows. 


MOV 0x1234, R4; set R4 to the hex 1234 
MOV Oxabcd, R5; set R5 to the hex abcd 
MOV.B R4, R5; move low byte of R4 to R5 


After the above statements are executed, R4 will 
keep 0x1234. Will R5 keep Oxab34, 0x1234, or 
0x0034. R 5 would not have 0x1234 because it only 
writes one byte to it. However, will R5 have the 
value Oxab34? The answer is no. R5 will have the 
value 0x0034. We know the value 34 is from R4 but 
why the high byte of R5 becomes 00? The reason is 
the register R5 is 16-bit wide and each write would 
have to be 16-bit data. Even though there is only 
one byte data from R4, that byte has to join another 
byte with zero and sent to R5. Therefore, R5 got 0 
for the high byte at the end. 


1. Index Addressing 


Index addressing describes an operand in memory 
by using a base address with a register that stores an 
index, which has the following format: 


Base(R) 


where the resultant address is calculated by . In 
MSP430, both source and destination operands are 
eligible for index addressing. The Base is stored in 
the word right after the instruction. Both source and 
destination operands may have index addressing 
bases, which cause two more extra operand fetches. 
Therefore, by and large, index addressing is slower 
than register direct or register indirect addressing 
modes. However, it is very convenient for array 
accesses as is illustrated in the following example: 


Table 1 An Example that Initializes an Array Using 
Index Addressing 


#include "quot;msp430.h"quot; ; #define cont 
NAME main ; module name 

PUBLIC main ; make the main label vissible 

; outside this module 

ORG OFFFEH 

DCLG tnt. + set reset vector to “init" abe: 
RSEG CSTACK ; pre-declaration of segment 


RSEG DATA16_N ; begin data segment 
A: DS8 20 ; reserve 20 bytes for array A 


RSEG CODE ; place program in 'CODE' segment 
init: MOV #SFE(CSTACK), SP ; set up stack 
main: NOP ; main program 

MOV.W #WDTPW+WDTHOLD, &WDTCTL ; Stop watchdo« 


MOV.W #20, R4 ; initialize R4 to 20 
start: TST R4 ; compare R4 to 0 

JZ done ; ends loop if R4 is 0 
MOV.B R4, A(R4) ; set A[R4] to R4 
DEC R4 ; decrease R4 by 1 

JMP start ; goto start 

done: ; end of loop 


JMP Samp; ; jump to current location 'Samp; 
; (endless loop) 
END 


Note that the MSP430 starts its data segment from 
the address 0x0200, and ends at OxOBFF. The 
DATA16_N is defined in the linker file for a device, 
e.g., Ink430F2013.xcl defines all linker information 
for the device MSP430F2013. It starts a data 
segment without initialization. The upper limit 
depends on its SRAM size. For example, the 
MSP430F2013 has 128 Bytes of SRAM. Therefore, 
its data memory starts from 0x0200 to 0x027F. In 
the above example, the assembler directive DS8 is 
used to reserve a memory space of size 20 Bytes. 
Should there is another variable to be defined 
followed by the array A, it will start at address 
0x0214. The example also demonstrates the use of a 


loop, a typical programming structure in any kind of 
programming languages. 


1. Register Indirect Addressing 


Register indirect addressing is the first choice if the 
operands are coming from memory because it is 
much faster than index addressing. It uses the 
following syntax to designate a memory location for 
an operand. 


@R 


Before the register indirect addressing, an address of 
memory has to be stored in the register. Typical a 
MOV instruction will take care of it. For example, 
the following statement will set the memory address 
0x0200 to R4, and copy the value at that location to 
Ro; 


MOV #200h, R4 
MOV @R4, R5 


Note that the register indirect only applies to the 
source operand in MSP430. The destination operand 
may not use register indirect addressing due to its 
design. What if the destination operand needs 
register indirect addressing? A workaround is use 
index addressing with base set to 0 as follows: 


MOV R5, 0(R4) 


This will achieve the effect of register indirect on 
R4. However, the base 0 is still required for another 
operand fetch. This is the reason why index 
addressing is slower than register indirect 
addressing. The following example shows how to 
quickly add all elements of an array using register 
indirect addressing. 


Table 2 Adding Elements of an Array Using Register 
Indirect Addressing 


#include "quot;msp430.h"quot; ; #define cont 
NAME main ; module name 

PUBLIC main ; make the main label vissible 

; outside this module 

ORG OFFFEh 

DC16 init ; set reset vector to ‘'apos;init’: 
RSEG CSTACK ; pre-declaration of segment 


RSEG DATA16_N ; begin data segment 

A: DS8 20 ; reserve 20 bytes for array A 
RSEG CODE ; place program in 'CODE' segment 
init: MOV #SFE(CSTACK), SP ; set up stack 
main: NOP ; main program 

MOV.W #WDTPW+WDTHOLD, &WDTCTL ; Stop watchdo« 


MOV.W R3, R5 ; initialize R5 to 0 

MOV.W #A, R4 ; initialize R4 to A's address 
start: CMP #A+20, R4 ; check if R4 reaches 
JZ done ; ends loop if R4 reaches last 
MOV.B @R4, R5 ; set RS to memory [R4] 

ADD R5, RS ; sum them together 


INC R4 ; decrease R4 by 1 

JMP “Start, 7. goto start 

done: ; end of loop 

JMP Samp; ; jump to current location 'apos;: 
; (endless loop) 

END 


In the above example, the beginning addressing of 
the array A has to be calculated and stored in the 
register R4. The array elements will be accessed by 
register indirect on R4. So R4 has to be increased by 
1 for each array element. The tricky part is how the 
loop is terminated when the last element is read and 
processed. Here a comparison instruction is used to 
check if R4 reaches last element’s address. The 
address of last element is calculated by A+ 20. Thus, 
the CMP instruction is comparing #A+ 20 to R4. 
Note that the # indicates an immediate value, 
which is only available to the source operand. So 
swapping the operands of the CMP instruction is 
semantically correct but syntactically incorrect! 


1. Memory Layout 


MSP430 adopts von Neumann architecture that 
employs a single memory space for both programs 
and data, which is different from the Harvard 
architecture where program memory is separated 
from data memory. Since the program and data 
share the same bus in MSP430, instruction fetches 
and operand fetches may take place one at a time. 


All the peripherals and special function registers are 
mapped to the single memory space. The memory 
mapped I/O allows programs to access I/O ports as 
if it were a register in CPU. 


1. Programmer ’ s View 


Depends on each individual MSP430 device, the size 
of SRAM and the size of flash vary, and thus the 
memory map may slightly change. Shown below is 
the memory map for MSP430. 


Table 3 Memory Map for MSP430 
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1. General Purpose Registers 


There are 16 general purpose registers in MSP430. 
Each register may hold a word of 16 bits (2 Bytes). 
They are not assigned addresses in the memory 
map, unlike other CPU designs. The first 4 registers 
(RO-R3) have some dedicated uses whereas the rest 
(R4-R15) are for general purposes. RO is used for 
program counter with the alternative name PC. R1 
is used for stack pointer with the alternative name 
SP. R2 is used for both CPU status register (SR) and 
constant generator (CG1). R3 is used for constant 
generator (CG2). The registers may be byte 
operated, in which high bytes are cleared. 


1. Special Function Registers 


Special function registers in MSP430 are assigned 
memory space from 0x0000 to OxOOOF. They 


include module enabling (ME1, ME2), interrupt 
enabling (IE1, IE2), and interrupt flags (IFG1, IFG2). 


1. Peripheral Registers with Byte Access 


Peripheral registers with Byte access are assigned 
memory addresses from 0x0010 to OxOOFF. They 
include I/O port control registers such as P1IN, 
P1IOUT, P1DIR, P1IFG, P1IES, P1IE, PISEL, and 
P1REN, and basic system clock control such as 
BCSCTL1, BCSCTL2, BCSCTL3, DCOCTL. Each of 
them is one byte wide. So the byte version of the 
instructions should be used. The actual address 
assignment for each of the above register slightly 
varies subject to each individual MSP430 
component. 


1. Peripheral Registers with Word Access 


Memory space ranging from 0x0100 to OxO1FF is 
used for peripheral registers with word access. The 
registers in this memory space include watch dog 
timer control register WDTCTL, timer control 
registers such as TACTL, TACCTLO, TACCTL1, TAR, 
TACCRO, and TACCR1. 


1. SRAM 


Starting from 0x0200 up to OxOBFF is the memory 
space for SRAM. The upper bound depends on the 
capacity of SRAM. For example, MSP430F2013 has 
128 Bytes of SRAM. Therefore, the SRAM is assigned 


the range 0x0200 to 0x027F. MSP430F2274 has 
1024 Bytes of SRAM, and thus has a SRAM range 
from 0x0200 to OxO5FF. This storage is used for 
program variables normally. Its content is wiped out 
if power is off. 


1. Bootstrap Loader 


The bootstrap loader is a serial communication 
program via the COM port of a PC to configure the 
flash memory in early MSP430s. Bootstrap loader is 
removed after F20XX for security concerns. 


1. Information Flash 


The memory address from 0x1000 to 0x10FF is 
reserved for a flash that stores non-volatile data. 
The stored information may include a serial number 
for the device, a MAC address for the network 
device, or the accounting information such as 
number of hours used. The information will remain 
in the flash memory even the power is off. 


1. Flash Code Memory 


Each time when an MSP430 is configured, the flash 
code memory is erased and loaded with the new 
program. Once the programming is done, the code 
of the program remains in the non-volatile flash 
memory. The address space for the flash code 
memory starts from OxFFBF and grows downward to 
0x1100 subject to the capacity of the flash in an 


MSP430 device. For example, the flash code 
memory size is 2 K Bytes in MSP430F2013. Its flash 
code memory is then assigned the memory space 
from OxF800 to OxFFFF (including the 64 bytes 
interrupt vectors). 


1. Interrupt and Reset Vectors 


The interrupt and reset vectors handles interrupt 
requires and system reset. One interrupt vector 
requires one word. So totally there are 32 interrupts 
allowed. The assigned address space is from OxFFCO 
to OxFFFF. Each of the vectors stores the starting 
address of the corresponding interrupt service 
routine (ISR). For example, when there is an I/O 
interrupt request such as I/O data ready, the CPU 
will look for the corresponding interrupt vector and 
serve the I/O request by running its ISR. Interrupt 
provides high performance I/O, and is the 
fundamental mechanism for a multiprocessing 
system. 


1. Exercise 
What is the maximal size of SRAM and flash code 
memory based on Table 3? Note that the bootstrap 


loader space may not be used for either SRAM or 
flash. 


Variables and Data Structures 


1. Overview 


Any program is created to process data. In this 
regard, a program may be thought of as a processor 
that receives data, processes data, and output data. 
A program basically is a set of instructions, which 
are stored in the program memory. In MSP430, the 
program code flash is the memory space that stores 
the program. Data are stored in the SRAM memory 
space. To process data, a program has to access the 
data via their addresses. However, it is tedious to 
use addresses to refer to the data. Instead, an alias 
name like x, y, or z is normally used as variables. A 
variable may be defined and a piece of memory in 
the SRAM space is associated with the variable. By 
using the variables, access to the data area becomes 
organized and straight forward. 


Variables are referring to their associated memory. 
In most applications, the way the data are organized 
is also important. For example, if there are 10 
student records to be stored, an array of 10 student 
records may be required. However, if the number of 
students is unknown, a linked list may be 
appropriate. 


1. Variable Declaration 
Variable declarations in assembly programming 


concern how much space required but disregard to 
their types. The declaration is the same whether the 


variable is an integer, a Boolean, or a float. 
Assembly programmers will have to differentiate 
types of variables somehow. One of the ways to 
enforce a type to a variable is naming. For example, 
variable names start with letter “i” are integers, “f” 
are floats, and “b” are Booleans. The names are 
actual labels in assembly, and they are case sensitive 
by default in the IAR system. That is “ia” is different 
to “iA.” Also the labels are lined up in the first 
column of the assembly source file. Labels may be 
followed by an optional colon by convention. The 
following example defines three variables, ix, iy, 
and iz, each of which is one byte. 


Table 4 Variable Declaration and Initialization 


#include "msp430.h" ; #define controlled in 
NAME main ; module name 

PUBLIC main ; make the main label visible 

; outside this module 

ORG OFFFEh 

DC16 init ; set reset vector to ‘apos;init’ 
RSEG CSTACK ; pre-declaration of segment 
RSEG DATA16_N ; start uninitialized data sex 
ix DB 0 ; define variable ix and set to 0 
iy: DB 0 ; define variable iy and set to 0 
iz DB 0 ; define variable iz and set to 0 
RSEG CODE ; place program in 'CODE' segment 
init: MOV #SFE(CSTACK), SP ; set up stack 
main: NOP ; main program 

MOV.W #WDTPW+WDTHOLD, &WDTCTL ; Stop watchdox« 


MOV.B #12h, ix ; set ix to 12h 
MOV.B #Z2lh,; iy 7 set iy to 27h 
ADD. Bax, Nz a AZ PS Se Se 
ADD. Baye. 22. - AY PLe Se ae 


JMP Samp; ; jump to current location ‘'apos;: 
; (endless loop) 
END 


The most common way to define variables in 
assembly programs is using the assembly directive 
DB, meaning define byte. Assembly directives are 
not machine instructions, but they notify assemblers 
to do something. The DB directive notifies the 
assembler to reserve one byte space. The label 
associated with the space is simply just another way 
to represent address at the memory location. 


1. Variable Initialization 


After a variable is defined, it must be initialized to 
some known value. Otherwise, the computation 
involves it may not be predictable! The DB directive 
is followed by an initial value. In the above 
example, all three variables are declared and 
initialized to zero. 


ix DB 0 ; define variable ix and set to 0 
iy: DB 0 ; define variable iy and set to 0 
iz DB Os. define variable 1z..and. set to: 0 
Note that the DS8 (allocate space for integ: 


In the example shown in Table 4, the variab. 
MOV.B #12h, ix ; set ix to 12h 
MOV 2B: 2ih; iy 4 set: iy te 27h 


1. Access Scalar Variables 


When variables are defined, each of them is 
associated with a label (address). This label is used 
for the assembly statement to reference to a 
designated variable. A variable keeps only one value 
is called a scalar variable. A scalable variable may 
require several bytes subject to the size of a 
variable. For example, an integer in C is normally 4 
bytes whereas a short is only 2 bytes. A double in 
the IEEE 754 standard requires 8 bytes. 


In MSP430, a variable may be referenced in 
absolute addressing or symbolic addressing. The 
absolute is actually indexed on SR, which always 
gives zero in the format absolute_address(SR). An 
ampersand (&) is prefixed to a label in assembly to 
indicate absolute addressing. The following 
statement references variable ix in absolute 
addressing mode. 


MOV.B #12h, &ix ; set ix to 12h 


The symbolic address is actually indexed on PC, and 
therefore, it is PC relative addressing. It has to the 
format, offset(PC), where offset is the displacement 
between the current PC and the address of a 
variable. The following statement references 


variable ix in symbolic addressing mode. 
MOV.B #12h, ix ; set ix to 12h 


From the programming perspective, the two 
addressing will yield the same result but there exist 
potential issues. First, the symbolic addressing will 
allow code to be loaded at any location of the 
memory because the variable address is relative to 
PC. The absolute addressing code has to be loaded 
at some certain location. Second, in some CPU, the 
absolute addressing would require a fetch for an 
address word whereas the offset in the symbolic 
address is small relatively and may be squeezed into 
an instruction. This means symbolic addressing may 
be better in performance. However, this is not the 
case in MSP430 because both symbolic addressing 
and absolute addressing require a fetch for the 
address (offset) for the reason that they are indexed 
addressing in nature. 


1. Data Types 


IAR supports constants for integers, characters, and 
floating points. This would ease constant definitions 
for the three data types. It is worth mentioning that 
the actual arithmetic operating upon each of the 
three data types will have to be designed 
accordingly. 


1. Integer 


Integers in most programming languages are 4 bytes 
in length. In 32-bit CPUs, the ADD instruction will 
add two integers together. However, in CPUs with 
less than 32-bit operands, adding two 4-byte 
integers will require an algorithm implemented in a 
subroutine or a macro. For example, adding two 
integers (4 bytes) cannot be done by the ADD 
instruction in MSP430 because the word size is 2 
bytes, and doing so would just add half of them! IAR 
implements 4-byte two’s complement integers. If the 
there is not enough space, only the low bytes are 
used. Negative numbers are designated by a leading 
mius (-) sign. The leading or trailing data 
representation code may be either uppercase or 
lowercase. 


Table 5 Integer Constants 


Ms. ~..-- 1AtTAL Litatna 

willaly LULUD , VivilV 

Natal 199Aa aITONODA! 
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vec isi1i1ust Law Le) a5 Maiaeiwy | 
Hexadecimal OxFFFF, OFFFFh, h'FFFF' 


The following example defines three 4-byte integers 


and an integer addition for summing two integers. 
The integers are defined as 32-bit constants using 
hexadecimal notations. The values (0x12335678 
and 0x8765B321) are selected to ease verification. 
Note that there is a carry for the low word addition, 
which has to be added to the high word addition. 
The subroutine (ADDI) is defined with the 
assumption that R13, R14, and R15 holds the 
addresses for the first integer, the second integer, 
and the result, respectively. Note also that the pond 
(#) sign in front of the variable labels in the MOV 
instruction designates the address. A missing pond 
sign in these statements will assign the 
corresponding register the value stored in the 
variable. 


Table 6 An Example that Defines 4-byte Integers and 
Their Addition 


#include "msp430.h" ; #define controlled in 
NAME main ; module name 

PUBLIC main ; make the main label visible 

; outside this module 

ORG OFFFEH 

DC16 init ¢ set reset vector to *init’ labe. 
RSEG CSTACK ; pre-declaration of segment 
RSEG DATA16_N ; start uninitialized data sex 


Te (DCS? Ux 2335676. 2) Ab. cise 2 (ORT ZA 33 5678 
yt DCS2Z, 0X8 765B321° 7 ant ay <= 0Of8 7658321 
Zz IDES WO > ae: ae FSO 


RSEG CODE ; place program in 'CODE' segment 


init: MOV #SFE(CSTACK), SP ; set up stack 
main: NOP ; main program 

MOV.W #WDTPW+WDTHOLD, &WDTCTL ; Stop watchdox« 
MOV #ix, R13 ; store ix's address in R13 
MOV #iy, R14 ; store iy's address in R14 
MOV #iz, R15 ; store iz's address in R15 
CALL #ADDI ; call subroutine ADDI for integ 


JMP Samp; ; jump to current location 'apos;: 
; (endless loop) 

ADDI: ; integer addition @R13+@R14 -> @R15 
ADD 0(R13), O(R15) ; put @R13's low word @R: 
ADD 0(R14), O(R15) ; add low words 

ADDC 2(R13), 2(R15) ; add @R13's high word \ 
ADD 2(R14), 2(R15) ; add high words 

RET 

END 


Integers of a variety of lengths may be defined using 
the following assembly directives. The DC’s will 
define an integer with an initial value whereas DS’s 
will just reserve spaces without initial values. 


Table 7 Assembly Directives for Integer Definitions 


Size of Integeis Definition with Definition 
Initial Values without Initial 
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1. Characters 


Characters form strings and messages that may be 
displayed on an output device such as an LCD panel. 
For example, if the machine is waiting for user’s 
inputs, a prompt such as “Please input your choice:” 
should be displayed. An error message such as “not 
enough deposit” should be displayed in a vending 
machine that requires more coins to release a cereal 
bar. Without those messages, it would be really hard 
to know what’s next in operating a machine. 


By and large, DC8/DB is used to define a string 
constant. The good thing about it is that there is no 
size argument. The assembler will automatically 
calculate the size of a string and allocate space for 
it. The following statements define the 
aforementioned strings. 


PromptDB“Please input your choice:” 


ErrMsgDC8‘Not enough deposit!’ 


PromptDB“Please input your choice:” 
ErrMsgDC8 ‘Not enough deposit!' 


The use of DB and DC8 is identical. In the first 
statement, a pair of double quotes is used for the 
string “Please input your choice:” The assembler 
will allocated one byte for each character in the 
string, plus one extra byte for the null character at 
the end. Overall, the allocated space will be the 
string length plus one. This is called null-terminated 
string. Most of languages such as C implemented 
null-terminated strings. The second statement above 
uses a pair of single quotes for the string, which 
results in an exact number of bytes as the 
characters. There is no null character at the end. 
Non-null-terminated strings will have to have their 
sizes kept somewhere to be correctly operated. In 
some situation, all non-null-terminated strings are 
designed to have the same length. Thus, the size 
information will not waste too much space. 


Table 8 Examples of Character String Definitions 
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Table 8 lists some string definitions. Two single 
quotes within a string are used to represent one 
single quote, which is part of the string. Double 
quote itself in a string is represented by a forward 
slash followed by a double quote. The forward 
slashes are used to indicate a forward slash in a 
string. Note that an empty string (nothing) is 
represented by a pair of single quotes. A pair of 
double quotes with nothing in-between indicates a 
null-terminated string, which contains just the null 
character. 


1. Floating Points 


Applications that work with floating points create a 
need for floating point declarations. IAR provides 
two assembly directives for floating point definitions 
such as DF32 (DF) and DF64. DF32 defines a single 
precision floating point whereas DF64 defines a 
double precision floating point. A single precision 
floating point occupies 4 bytes in memory, and a 
double precision floating point occupies 8 bytes in 
memory. The syntax for a floating point number is 


as follows: 
[+|-] [digits] [.digits] [{E]e}[+|-]digits] 


The square brackets mean optional and the vertical 
bar is used to separate options. Curley braces are 
used for a set of options. Based on the above 
floating point syntax, the following are legal floating 
points. 


3.14 

6.02E+ 23 
-1.602176565e-19 
.1234 


The above numbers are interpreted as decimal 
numbers, meaning that the base of the exponent is 
10. So the number 6.02E+ 23 is . The assembler will 
convert them to IEEE 754 floating point formats, 
and allocate 8 bytes for double precision and 4 bytes 
for single precision. The following statement defines 
a single precision floating point value 8.0. 


f{DF328.0 


Since this is a single precision floating number, the 
assembler converts 8.0 to 0x41000000, and stores it 
in memory. The following shows the IEEE 754 
single precision floating point format. 


Table 9 IEEE 754 Single Precision Representation 
for the Number 8.0 
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In Table 9, the exponent is , and the mantissa is 0, 
meaning the fraction of the normalized number is 0. 
With the leading 1, the number is . 


1. User-Defined Data Types 


Most assemblers only support a handful number of 
types such as integers, characters, and floating 
points in IAR. By “support” it does not mean the 
CPU actually supports arithmetic directly on those 
data types. In IAR, the supported data types give the 
assembly programmers the ability to define 
variables with initial values of those data types. The 
programmers will have to design algorithms in 
assembly to operate upon those data. Similarly, 
programmers may define variables of virtually any 
data type, say, an array of 10 words. Basically, the 
programmers only need to let the assembler know 


the size of the variable (or object). The assembler 
will reserve the amount of space for the variable. As 
to the operation of the variable, like the supported 
data types, it is the discretion of the programmers 
who have to design suitable algorithms for it. 


1. Pointer Data Type 


Pointers are basically variables that keep address, 
instead of values. The job of variables is like 
temporary storages that keep goods. If we need 
something, get it out form the storage. If we’ve done 
using it, we may put it back to the storage. The 
storage may keep actual items, or some note that 
tells the user where to get the item. Likewise, if a 
variable keeps the place where the actual value is 
stored, this variable is of pointer data type. 


Pointer data type is a need to implement data 
structures such as linked list, tree, etc. Data 
structures that have a dynamic characteristic require 
pointers. The dynamic characteristic allows an 
efficient memory use in the system. For example, in 
situations where the number of student records is 
unknown before a program starts, the linked list 
data structure is more suitable than array, which is 
static. In embedded system where memory is 
considered a precious resource, creating a big array 
upfront is not feasible. 


Table 10 lists source code for a simple dynamic 


memory management system that implements a 
first-fit algorithm using linked list data structure. 
The dynamic memory (a.k.a. heap) and the stack 
share the same memory space. The stack grows 
downward whereas dynamic memory grows 
upward. We use a variable called brk that keeps the 
watermark of the heap, i.e., the highest address in 
the memory that has been used for the heap. The 
program also keeps a variable for the beginning of 
the heap, named bgn. 


The heap for this application is divided into two- 
word objects for linked list nodes. Each node has 
two fields: data and next, each of which is one word 
in length. The data field stores a value. The next 
field is a pointer keeping the address of the next 
node in the linked list. 


Table 10 A Simple First-Fit Dynamic Memory 
Management System Using a Linked List Data 
Structure 


#include "msp430.h" ; #define controlled in 
NAME main ; module name 

PUBLIC main ; make the main label visible 

; outside this module 

ORG OFFFEH 

DCLS.init.- + Set. reset vector to “init” abe. 
RSEG CSTACK ; pre-declaration of segment 


RSEG DATA16_N 
bgn DW $+4 ; keep heap start address 


brk DW $+2 ; keep heap watermark 


RSEG CODE ; place program in 'CODE' segment 
init: MOV #SFE(CSTACK), SP ; set up stack 
main: NOP ; main program 

MOV.W #WDTPW+WDTHOLD, &WDTCTL ; Stop watchdox« 


CALL #alloc ; create a head node, 3 
MOV R15, R4 ; set R4 to address 
MOV #3, R5 ; set R5 to value 

CALL #setNode ; setNode sets fields 
MOV R4, R6 ; R6 keeps the head 


CALL #alloc ; create a node, 5 

MOV R15, 2(R4) ; link to head 

MOV R15, R4 ; set R4 to address 
MOV #5, R5 ; set R5 to value 

CALL #setNode ; setNode sets fields 


CALL #alloc ; create a node, 7 

MOV R15, 2(R4) ; link to previous node 
MOV R15, R4 ; set R4 to address 

MOV #7, R5 ; set R5 to value 

CALL #setNode ; setNode sets fields 


JMP S$ ; jump to current location Sapos; Samp, 
; (endless loop) 

setNode: ; subroutine to set linked list no 
; (data, next) 

PUSH R4 ; R4 keeps address 

MOV R5, O(R4) ; set data file in the node 


ADD @R3, R4 ; R4 + 2 -> R4 (next) 
MOV R3, O(R4) ; void next field 
POP R4 ; Restore R4 

RET 


alloc: * “Lirst-fit -returns free: node-.at R15 


PUSH R12 ; temp register 
PUSH R13 ; temp reigster 
PUSH R14 ; temp register 


MOV bgn, R13 ; R13 keeps start address of R: 
MOV brk, R14 ; R14 keeps heap watermark 


Ll: 


CMP R14, R13 ; check if watermark raises 


JZ L2 ; jump to raise watermark 

ADD @R3, R13 ; R13 + 2 -> R13 

MOV @R13+, R12 ; get the pointer of 
AND 0(R3), R12 ; R12 & 1, free node 
; set in pointer 

JZ Ll ; non-free node must not have 
* bat ser! 

ADD #-4, R13 ; back to node address 
JMP L3 ; found free node 

L2: ADD @SR, brk ; brk + 4-> brk 
L3: MOV R13, R15 ; set free address 
POP R14 ; restore R14 

POP R13 ; restore R13 

POP RI2 ¢ restore R1Z 

RET 

END 


Two subroutines are implemented: setNode and 


this no 
has zer 


zeroth 


at R15 


alloc. Set Node receives two arguments, the node 


address and the value for the data, in R4 and R5, 
respectively. It first back up R4 in stack because R4 
will be modified in the subroutine. The caller may 
need R4 later. So the best way is to put it in stack 
and restore it before returning back to caller. This is 
a typical technique to reuse register in subroutines. 
The setNode subroutine will set and initialize fields 
in the node accordingly. 


The alloc subroutine implements a first-fit algorithm 
that searches the first free space starting from the 
beginning address to the watermark of the heap. 
Since the size of the linked list node is two words, 
i.e., 4 bytes, the zeroth bit and the first bit of the 
node addresses will always be zero. Therefore, they 
may be used for memory status bits. In this 
program, we assume the zeroth bit indicates if the 
node is free or occupied. A free node will have the 
zeroth bit set in its next field. The alloc subroutine 
reads the next field and checks if its zeroth bit got 
set. If yes, the free node is found and its address is 
adjusted for return. 


In case of nodes that are all occupied, the 
watermark (brk) will have to be raised, i.e., the 
heap grows. Practically, the heap may grow or 
shrink. The former will accommodate more object 
allocations, and the latter would give more space to 
the stack. For the complexity of the program, the 
heap shrinking is not implemented. However, the 
readers should be able to modify the program for 


heap shrinking. 


For complexity concern, the free subroutine is not 
shown in this source code list. However, it can be 
really simple and should be done in no time. Freeing 
an object is simply set the zeroth bit of the next field 
in a node. However, practically, a variable that 
keeps the size of the total free space is helpful. For 
example, the alloc subroutine would not be 
necessary to go over a packed heap and finally raise 
the watermark. Instead, if the total free space is less 
than the amount requested, the search will be 
bypassed and the watermark is raised directly. This 
will greatly increase the alloc performance. 


1. Array Data Types 


An array is a group of items of the same size. Access 
to a particular element can be made by calculating 
its displacement plus the beginning address of the 
array. Thus, the array data structure is considered as 
a fast access mechanism for storing data with equal- 
sized records such as matrices. For example, if we 
store 10 numbers in an array of 10 words (two bytes 
each), the address of an element i is calculated as 
follows: 


The array indices start from 0 to 9, which is . In 
CPUs like MSP430, multiplication may not be 
implemented. The multiplication in the address 
calculation may be implemented by left shift by one 


bit, which is arithmetically the same of 2 times. 


The array shown in the above example is one- 
dimensional. However, it is quite often that a multi- 
dimensional array is required in situations like a 
two-dimensional matrix multiplication. Because the 
memory is always organized in one dimension, a 
two-dimensional array will have to be organized in 
one dimension memory. The section followed will 
discuss two approaches in organizing two- 
dimensional arrays to one-dimensional memory. 


1. Row-Major Order versus Column-Major 
Order 


Arrays with more than one dimension are called 
multi-dimensional arrays. To store multi- 
dimensional arrays in memory, a mapping from 
multi-dimension to one dimension has to be in 
place. Basically, we want to store multi-dimensional 
array elements in a linear address space. The first 
approach, called row-major, stores each row of an 
array in sequence. For example, a two-dimensional 
array a[3][4] has the following mapping in row- 
major order. 


Assume each element in the array is of two bytes in 
length. The twelve elements of the two-dimensional 
array are mapped to a one-dimensional array from 
a[0] to a[11]. What we want now is to compute the 
address of a given array element . Since the 


mapping is quite regular, the address calculation is 
as follows. 


The calculation is based on number of rows ahead of 
this element plus the displacement of the element 
from the beginning of the row. The number 4 is the 
column dimension. In a three-dimensional array, the 
address calculation in row-major is as follows. 


The is number of elements in the dimension, where 
starts from 0. For example, an array has 


Column-major on the other hand organizes array 
elements in columns by columns. The following 
shows the column-major organization for the array . 


In column-major ordering, is mapped to where it is 
mapped to in row-major ordering. Obvious, the 
address mapping is related to the row dimension, 
i.e., the number of elements in a column. In the two- 
dimensional case, the row and column indices are 
swapped in the address calculation. 


The number 3 in the above address calculation is 
the row dimension. In a three-dimensional array, 
the address calculation in column-major is as 
follows. 


Assembly Programming - Part 2 

This chapter introduces MSP 430 assembly 
programming language such as instruction formats, 
addressing modes, subroutine call and return 
mechanisms (cross-reference PL/Language 
Translation and Execution), I/O and interrupts, and 
Heap vs. Static vs. Stack vs. Code segments. Explain 
the organization of the classical von Neumann 
machine and its major functional units. Describe 
how an instruction is executed in a classical von 
Neumann machine, with extensions for threads, 
multiprocessor synchronization, and SIMD 
execution. Summarize how instructions are 
represented at both the machine level and in the 
context of a symbolic assembler. Demonstrate how 
to map between high-level language patterns into 
assembly/machine language notations. Explain 
different instruction formats, such as addresses per 
instruction and variable length vs. fixed length 
formats. Explain how subroutine calls are handled at 
the assembly level. Explain the basic concepts of 
interrupts and I/O operations. Write simple 
assembly language program segments. Show how 
fundamental high-level programming constructs are 
implemented at the machine-language level. 


Emulated Instruction 


Emulated instructions are aliases of instructions 


with fixed operands. For example, increment of a 
loop variable by 1 is a common case in most 
applications. So an emulated instruction, INC, is 
created to increase its operand by 1 in IAR. The 
actual instruction is the ADD instruction. The 
advantage of using emulated instructions is obvious 
that the program would be readable, and sometimes 
the performance is better. 


INC R4 ADD #1, R4 immediate addressing The 
emulated instruction acts like a single instruction 
macro. The difference is that a macro is user-defined 
whereas an emulated is predefined by the system. 
The transcription depends on the assembler 
implementation. The above increment emulated 
instruction may be transcribed to the following 
better performance instruction using constant 
generator for one. ADD 0(R3), R4 O(R3) constant 
generator for 1 However, the INC instruction is 
known by the assembler (different assembler may 
define different emulated instructions), not the CPU 
directly. The assembler will transcribe the emulated 
instruction to the actual instruction. In the IAR 
system, there are 23 emulated instruction defined as 
listed in Table 12. The emulated instructions contain 
1) carry related operations (ADC, CLRC, DADC, 
SETC, SBC), 2) bit operations in SR (CLRC, CLRN, 
CLRZ, DINT, EINT, SETC, SETN, SETZ), 3) common 
increment/decrement (DEC, DECD, INC, INCD), 4) 
stack operation (POP), 5) others (BR, NOP, RET, 
RLA, RLC, TST). 


Table 12 MSP430 Emulated Instructions 
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Note that most of the constants such as 0, 1, 2, 4, 8, 
and -1, may be replaced with constant generators to 
get a better performance. The NOP operation would 
not change any values but consume a machine 
cycle. In cases where the CPU has to wait on some 
events, the NOP is filled to achieve this delay. 


Arithmetic and Logical Operations 


An expression is a combination of operators and 
operands that may be evaluated to a single value. 
For example, is an expression that defines the sum 
of the four operands . Some operations such as 
multiplication and division are not provided directly 
by MSP430’s instruction set. Those missing 
operations will have to be implemented in 


subroutines or macros. The instruction set in 
MSP430 uses destination operand as a source 
operand. Therefore, the above expression may be 
translated to the following machine code. 


MOV R3, R4 ; initialize R4 to 0 
ADD x, R4 ; xX + R4 -> R4 
ADD y, R4; y + R4 —> R4 
ADD z, R4 ; z+ R4 -> R4 
ADD #1234, R4 ; 1234 + R4 -> R4 


After the above statements are executed, the sum of 
the four numbers will be stored in the register R4. 


Assignments 


After an evaluation of an expression, the value is 
stored in a variable for future reference. Otherwise, 
it makes no sense to compute the expression. As a 
matter of fact, this simple assignment has been 
employed in each ADD instruction, where the 
destination operand always receives the sum. In the 
previous example, we may assign the expression to a 
variable, say, v, as follows: 


The machine code for this assignment includes the 
evaluation of the expression, and a MOV statement 
at the end to store the result (R4) in the variable v. 


MOV R3, R4 ; initialize R4 to 0 


ADD x, R4 ; x + R4 -> R4 

ADD y, R4; y + R4 -> R4 

ADD z, R4; z+ R4 -> R4 

ADD #1234, R4 ; 1234 + R4 -> R4 
MOV R4,-v 7 store result.anv 


Commutative Operators 


Operations like addition, logical AND, and logical 
OR, are commutative, meaning that the order of 
their operands does not affect the result. For 
example, . Therefore, adding x first or y first yields 
the same result. However, if the x, y are in registers, 
there is a concern as where to put the sum. If one of 
the values of x and y, is needed for future 
computation, that value should be placed as the 
source operand. In the following example, R4 keeps 
the value of x after the addition. So R4 may 
participate in computation that needs the value of x 
without reloading its value from memory. Moving 
data from memory unnecessarily will not only result 
in worse performance, but also increase code size. 


MOV x, R4; x -> R4 
MOV “VY, ROP: Y-=> RS 
ADD R4; RS. 3 & + y => RS, R4 2s intact 


Logical Expressions 


A logical expression is a relational expression or a 
Boolean expression that may be evaluated to either 
true or false. For example, is a relational expression 
that defines the relation between x and y. However, 
there is no Boolean data type in assembly. What we 
can do is use 0 and 1 to represent false and true, 
respectively. In this way, we may assign a Boolean 
value to a Boolean variable as follows. 


Here the assignment notation “: =” is used to 
differentiate it from the equal “=” sign, which 
checks the equality of two numbers. There is no 
direct instruction to translate to a Boolean value in 
assembly. Therefore, a comparison (CMP) has to be 
used to check the relation first. The comparison 
results are set in the SR register. A control flow 
should be implemented to assign either 0 or 1 to the 
variable z. Below is the machine code for the above 
Boolean assignment. 


MOV -y; R49 y -—> R4 
MOV 7 RSs. 2 He. RS 
CMP? R4, RS: fo ye: 22 


JC No 

Yes:MOV R3, x; x := 0 
BR Done 

No:MOV O(R3), x; KX := 1 


Done: 


In the above code, the label “Yes” is not necessary 
but it will make programs readable. It is inevitable 
to have a control structure implemented because the 
relational comparison stores the results in the SR 
register. The Boolean expression such as , may be 
implemented using bit-wise AND operation directly. 
The following machine code implements the 
Boolean assignment. 


MOV y, R4; y -> R4 
MOV 2, RS? 2 -—> “RS 
AND R4, R5; R4 & R5 -> RSD 
MOV (RS; 2%; RS => x 


Note that the bit-wise AND operation will yield 1 
when both R4, and R5 have a value 1. In systems 
that assume 0 for false, and non-zero for true, the 
bit-wise AND operation may not work. For example, 
Ox2 and Ox1 are all logical true (non-zero) but . 


There is no explicit inclusive OR in MSP430 but the 
instruction BIS may be a substitute. The only 
difference is that SR is not affected by the BIS 
instruction. Should there is a following control 
structure dependent on the computation result, the 
BIS may be followed by a dummy ADD instruction. 
The following example illustrates the Boolean 
assignment . 


MOV Vy RAS yose Ra 
MOV 2, RS}? °Z => RS 
BIS R4, R5; R4 & R5 -> RS5 


MOV RS; FP RSP. ie 


Powers of Two Arithmetic 


Digital systems are based on binary arithmetic 
because the memory cell in hardware can only 
represent either high or low state. This creates some 
well-known relations between shifting and 
multiplication/division by two and its powers. The 
following section discusses these operations. 


Multiply by Two to the Powers of Two 


If we contemplate the powers of two, 1, 2, 4, 8, 16 
with their binary representations, 1, 10, 100, 1000, 
10000, the left shifting regulation is found. Since 
numbers are stored in binary format in machines, 
we may left shift one bit of a value to achieve the 
same effect as multiply if by 2. If we left shift it 
again, we will actually multiply it by 4, which is . It 
turns out that any powers of 2 () will be done in this 
similar way. The following example illustrates 
multiply by operation. 


MOV x, R4; x -> R4 
RLA R4; R4*2 -> R4 
RLA R4; R4*2 -> R4 
MOV R4, xX; R4 -> xX 


It is worth mentioning that the RLA instruction is an 
emulated one, which actually uses the ADD 
instruction that adds the operand to itself. MSP430 
provides rotation shift through carry. The carry bit 
requires to be cleared before rotations if a zero is 
expected. 


Divide by Two to the Powers of Two 


Divided by two or the powers of two may be 
implemented using right shift. For example, 4 
(0100) right shifted by one bit will be 2 (0010). 
Note that right shift is not identical to divided by 
two. For example, a negative number in two’s 
complement format with its MSB 1 will be right 
shifted and the resultant value might become 
positive! For example, -7 in 4-bit two’s complement 
representation is 1001. If we right shift it by one bit, 
it becomes 0100, which is 4. Thus, MSP430 provides 
an arithmetic right shift (RRA) for this reason. The 
following example illustrates divide by operation. 


MOV x, R4; x -> R4 
RRA R4; R4/2 -> R4 
RRA R4; R4/2 -> R4 
MOV R4, xX; R4 -> X 


Remainders 


A number divided by another yield quotient and 
remainder. The relation is , where qg is quotient and 
r is the remainder after x is divided by y. General 
speaking, computation for the remainder requires an 
algorithm, if there is no direct instruction from the 
CPU, which is the case in MSP430. However, if the 
divisor (y) is a powers of two, a simple logical AND 
operation will do the trick. For example, the 
remainder of a number x divided by 8 is calculated 
as follows: 


MOV x, R4; x -> R4 
AND #0x7, R4; R4 & Ox7 -> R4 


AND #0x7, R4; 


The binary representation of 0x7 is 0000 0000 0000 
0111, which will extract the first 3 bits from the 
number. The first 3 bits (value 0 to 7) are the 
remainder of the number divided by 8. The 
immediate constant is always one less than the 
divisor, which is a powers of two. 


Control Structure 


Overview: Control structures implement program 
control flow such as repetitions, and conditional 
jumps. A piece of code in a program may be 


repeated for execution in a number of times. A loop 
is typically implemented for that piece of code. 
Some of the code may be executed based on a 
condition. If the condition holds, the code is 
executed. Otherwise, the code is bypassed. In some 
applications, computations are based on user’s 
inputs. Therefore, the code is organized to be 
executed based on the corresponding conditions 
from user’s inputs. 


If Statement 


A simple IF statement evaluates a Boolean 
expression and executes the enclosed code if the 
Boolean expression is evaluated to true. The 
semantics can be described by the following 
statement in a typical high level language. 


if (R4 == 0) { 
R5 := 1; 
} 


The value of R5 in the above statement will be 1 if 
R4 is 0. If R4 is not 0, R5 remains whatever value it 
holds. The statement includes a Boolean expression, 
R4= =0, that is the condition for the following 
assignment statement. If the condition is true, the 
assignment statement is executed. If the condition is 
false, the assignment must be skipped. A jump 
instruction along with a label is used to skip 


instructions. Below is the translated machine code. 


TST R4 ; test if R4== 


JNZ Skip; jump to skip if R4 != 0 
MOV 0O(R3), R5 ; RS := 1 
Skips 


In the above example, only the true statement 
appears. If R4 is not zero, R5 may be assigned to a 
value, say 2, a false statement can be implemented. 
The else-part is as follows: 


if (R4 == 0) { 
RS: = "19 

i 

else { 

RO: f= 23 


Obviously, only one of two assignments is executed. 
The following machine code implements this if- 
then-else structure. 


_if: TST R4 ; test if R4== 
JNZ _else ; jump to else 


_then: 

MOV "O(RS)¢. RS--2 RS: Yee 
BR _endif ; jump to endif 
_else: 

MOV @R3, R5 ; RD := 2 


_endif: 


In the above example, the labels are prefixed with 
an underline to differentiate them from the 
predefined symbols in the system. The first two 
labels (if and then) are redundantly created for the 
purpose to make the code readable. The required 
labels are else and endif. 


Switch Case Structure 


If there are several cases to consider, a structural 
implementation is a swith-case structure. For 
example, we may use the following switch-case 
structure to count the occurrence of letters. 


Switch(R4) { 
case ‘A’: 
Pt 
break; 
case ‘B’: 
Ror; 
break; 
case ‘CC’: 
Ries 
break; 
default: 
Ret 

} 


The above case-switch statement is translated to the 
following machine code. 


CaseA: CMP #'A', R4 ; test if R4='A' 
JNZ CaseB ; jump to else 


INC RS 3 RS <= RS + 1 

BR EndSwitch ; jump to EndSwitch 
CaseB: 

CMP #'B', R4 ; test if R4='B' 
JNZ CaseC ; jump to else 


INC R6 ; R6 R6 + 1 
BR EndSwitch ; jump to EndSwitch 
CaseC: 


CMP #'C', R43; test if R4='c' 
JNZ Default ; jump to Default 


INC. RY os Re Ree. ar a 

BR EndSwitch ; jump to EndSwitch 
Default: 
INC R8 ; R8 
EndSwitch: 


Re oe 


Loops 


Loop is a common control structure and has been 
widely used in a program. Normally, a predicate 
composed of loop variables is used to determine if 
the loop should continue or terminates. The 
predicate is implemented in a Boolean expression, 


which may be evaluated before or after a loop 
iteration is executed. The while loop evaluates the 
loop predicate before a loop iteration starts. The 
repeat-until structure evaluates the loop predicate at 
the end of an iteration, which allows the loop to be 
executed at least once. Should a loop become large, 
programmers may forget to modify loop variables, 
which results in an infinite loop. A more structural 
construct is for-loop which places the primer, 
predicate, and modification of loop variables 
upfront. It forces programmers to implement each 
component in the loop to avoid mistakes. 


While loop 


The basic structure of a while loop includes primer, 
loop body, and modification of loop variables. The 
primer initializes loop variables, and evaluates the 
loop predicate. The loop body contains statements 
to be executed in the loop, and the modification of 
loop variables. The following while loop implements 
multiplication for . 


x = 100% 

vy s= -2003 
Boo 07 

while (y != 0) { 


py i= pot x; 
yri=y- li 
} 


The above algorithm is translated to the following 
machine code. Assume variable x, y, and p is 
associated with the register, R5, R4, and R6, 
respectively. 


%& R6 = RD5 * R4 

MOV.W #100, R5 ; loop primer 

MOV.W #200, R4 ; used for countdown counter 
MOV.W R3, RO ; init R6 to zero 

TST R4 ; predicate evaluation 

begin: JZ done ; conditional jump 

ADD R5, R6 ; loop statements 

DEC R4 ; modification of loop variables 

jmp begin ; repetition 

done: ; end of the loop 


All the instructions before “begin” are the loop 
primer. It is inevitable to have two labels “begin” 
and “done” in the program. The “begin” marks the 
beginning of the loop whereas the “done” label 
indicates the end of the loop. The predicate 
evaluation “TST R4” is added before entering the 
loop because the Z bit would not be set by the MOV 
instruction. It is however executed once because the 
further predicate evaluation is done automatically 
by the instruction “DEC R4” which also decreases 
the loop variable R4 by one. If it reaches zero, the Z 
bit will be set and it will not be cleared by the “jmp 
begin” instruction. So the following “JZ done” 
instruction will correctly terminates the loop. 


Repeat-Until 


If a loop has to be executed at least once, the repeat- 
until structure is used. This is especially useful if 
some variables are initialized in a loop, and used for 
further computation after the loop. The repeat-until 
construct will guarantee the variable initialization. 
The following example computes the length of a 
null-terminated string. In IAR, the following 
declaration allocates null-terminated string (the 
actual allocated space is one more than the length of 
the string). 


RSEG DATA16_N 

str DB "1234" ; define a null-terminated st: 
W 12 3 4 W 

len DB 0 ; len of strin, initialize to 0 


Note that the system will allocate 5 bytes for the 
double quotes string “1234”. If the single quotes are 
used instead such as ‘1234’, only 4 bytes are 
allocated. In the case, there is no null character at 
the end. Even though it saves some space, there is 
no way of size inference for the stored string. 


The following algorithm computes the length of a 
null-terminated string using the repeat-until 
construct. What we want to find out is the index of 
the character string where there is a null character. 
Therefore, the index will be the length of the string. 


ses 


Repeat { 


Yo aly 
} until (str[y] == 0); 
len = y; 


The primer of the repeat-until construct will 
normally set values to be one less than the start 
pints. The reason is due to the nature of the repeat- 
until which changes loop variables before the 
condition test. In the above string length 
computation, the initial value of the index y is set to 
-1. So the computation can start at 0, which is the 
first character of the string. Moreover, the array 
access str[y] in the until statement requires to be 
index addressing if the TST instruction is used. Since 
MSP430 won’t allow base to be a register in index 
addressing, i.e., R(R) format, the actual array 
address calculation requires another register. The 
following translated code implements the above 
algorithm. 


MOV @R3+, R4 ; set R4 to -1 

MOV #Str-1, R5 ; R5 keeps address of the st: 
repeat: 

ADD 0(R3), R4 ; R4+ 1 -> R4 

ADD: O(R3)>: Ro 3° RS + 1 = R5 

TST.B  -0(RS) 7 test if -str[y]. is null 

JZ until; jump to until (done) 

JMP repeat; repeat the loop 

wnial: 

MOV.B R4, len; store the length 


For-loop 


For-loop is highly recommended for loop 
implementation as it has a very good structure for 
all the constituents of a loop. The initialization, 
predicate, and loop variable modifications are 
sitting together. This good programming practice 
would avoid hard-to-debug errors in loop 
implementation. In the string length calculation 
example, the following algorithm employs a for- 
loop. 


for (y = 0; str[y]!=0; ytt); 
len =y +1; 


The body of the above for-loop construct is empty as 
most of the tasks are done in the for-loop header. 
The following translated machine code for the string 
length computation has exact number of instruction 
(8 instructions) as it is implemented using repeat- 
until construct. However, the for-loop evaluates the 
loop condition first. Therefore, the initialization part 
is straightforward that the index is set to 0 not -1. 
The address of the string is also set to its beginning 
address not one less than that. 


MOV R3, R4 ; initialize R4 to 0 
MOV #Str, R5; R4 keeps address of str 
for: TST.B 0O(R5) 


JZ endFor 

INC R4 

TNC RS 

JMP for 
endFor: 

MOV.B R4, len 


Register usage and Loops 


Perhaps, loops constitute most of the execution time 
for a program. A rule of thumb is put variables 
involved in the body of a loop into registers if 
possible, because the access time to registers is 
about several hundred magnitudes faster than 
memory accesses. Thus, the programming paradigm 
would be first load variables from memory in the 
primer part of a loop. Then the computation in the 
loop body will mainly reference to registers. After 
the loop terminates, the variables are stored back to 
memory. 


In some cases there are not enough registers (only a 
handful is available at the time for the loop 
execution), register may be pushed to stack before 
entering the loop. After the loop terminates, 
registers are then restored from the stack. By doing 
so, the registers may be reused in the loop to 
improve performance. 


Performance 


The 80/20 rule states that 80% of a program 
execution time is spent on 20% of its code. The 20% 
of the code most likely would be loops. Therefore, 
loop performance has direct impacts on the overall 
program runtime performance. In the following 
sections, techniques in optimizing loops are 
presented. 


Test at the End 


There are two jump instructions in the example for 
string length computation. These two jump 
instructions may be combined to one in each loop 
iteration. This one less instruction would save a lot 
of time should the loop iterates hundreds of 
thousands of times! Below is the optimized 
implementation for for-loop. The other loop 
structures are similar. 


MOV R3, R4 ; initialize R4 to 0 

MOV #sStr, R5 ; R4 keeps address of str 
JMP test ; jump to test 

fOr: 

INC R4 ; R4+1 -> R4 

ING “RS. 2 RS-4 1: => RS 

test: ISTsB.O(RS): - test. 2E strly)] =— null 
JNZ for ; repeat the loop 

endFor: 


MOV.B R4, len ; assign the length 


Note that the optimized code contains 8 instructions 
but the main loop (from the label for: to the 
statement JNZ for) only has 4 instructions, which 
has 5 instructions in the repeat-until and for-loop 
implementations. 


Instruction Weights in a Loop 


Indirect addressing has a lighter instruction weight 
than index addressing in MSP430. The limit of the 
destination addressing forces the index addressing in 
the condition test instruction TST.B in the above 
code. A further optimization chance is to take 
advantage of the rich source addressing in MSP430. 
Since the TST is an emulated instruction, which uses 
CMP instruction actually. Therefore, by using the 
CMP instruction, we may employ indirect auto- 
increment addressing on the source operand for the 
string address update. This would increment the 
string address and indirect access to a string 
character at the same time. Below is the 
implementation. 


MOV R3, R4 ; intialize R4 to 0 

MOV #Str, R5 ; R4 keeps address of str 
JMP test ; jump to test 

for: 

INC R4 ; R4 +1 -> R4 


best: CMPV.B CGRS+; RS 7 Lest LE str[y)-= 


JNZ for ; repeat the looop 
endFor: 
MOV.B R4, len ; assign the length 


The total instruction count is down to 7, and most 
importantly, the number of instruction in the main 
loop is down to only 3. It is worth mentioning that 
the indirect auto-increment addressing has a better 
performance as the index addressing has to fetch the 
base constant in addition to the instruction fetch. 


Executing the Loop Backwards 


If a loop iterates 100 times, the loop variable would 
go from 0 to 99. A register is normally associated 
with the loop variable. After each iteration, the loop 
variable is increased by one. The loop condition 
would be testing if the loop variable equals to 99, 
which may require a subtraction and check if the 
result is zero. 


In some CPUs, testing zero is faster than comparing 
if two numbers are the same. Normally, comparing 
two numbers requires a subtract instruction. If two 
numbers are equal to each other, subtracting one 
from the other results in zero. Checking if a number 
is zero is actually OR’ing all the bits. 


If a loop iterates 100 times, the loop may go from 


nul. 


100 down to 1. The loop variable would start at 
100, and decrease by one in each iteration. By the 
time, it reaches zero, the loop terminates. So the 
loop condition will be checking if the loop variable 
equals to zero. This zero testing is normally better 
performance than subtraction. Moreover, the loop 
condition test is required for each iteration. Its 
performance has a direct impact on a loop. 


Nested Loops 


Nested loops are loops within loops. For example, in 
a two-dimensional matrix multiplication, there are 
three loops. Assume the matrix where the subscripts 
are the dimension of each matrix, respectively. 
Below is the algorithm to compute the matrix 
multiplication. 


For bH=049 5. mere ys 4 
Loe (=O; 95 Np ety 4 
for (k=0;k< r;k++) { 
Cliy.5) = Citys rAy eB 1 


} 
} 


In the above matrix multiplication algorithm, the 
initial values for are zero. The time complexity is , 
where is the maximal dimension of the matrices. 


Timing Delay Loop 


In most applications, there is a need to slow down 
program executions. For example, if an LED is 
turned on and off for each second, there is a one 
second delay in-between on and off. There are ways 
for this delay, such as timers, loops, and the like. 
The easiest way is a delay loop, which is a loop 
without any useful computation. The following 
example illustrates a delay loop for about 2000 
cycles. 


MOV #1000, R4 
delay_loop: 
DEC R4 

JNZ delay_loop 


The loop will iterates 1000 times with the value of 
R4 changing from 1000 to 1. The DEC instruction is 
an emulated instruction for the instruction SUB.x 
#1,dst. The immediate value 1 can be from the 
constant generator 0(R3). So it takes one cycle. 
Another cycle is for the instruction JNZ delay_loop. 
Therefore, in each loop iteration, there are two 
cycles consumed. Altogether, the whole loop spends 
2000 cycles. If the MSP430 is running at 25 MHz, 
the 2000 cycles will be equivalent to 80 ns. In this 
case, a one-second delay would have to set R4 to 
12500000. However, the maximal possible value for 
R4 (2 bytes) will be OxFFFF, which is 65535 in 
decimal. Therefore, to implement one-second delay, 


a nested loop is required. The following nested loop 
implements a one-second delay running at 25 MHz. 


MOV #190, R4; set outer loop count for 190 
MOV) -R3; RS. set RS to 10 

delay_loop:; entry point for both loops 
DEC R5; R5-1 -> R5, inner loop counter 

JNZ delay_loop; loop if R5 not 0 

DEC R4; R4-1 -> R4, outer loop counter 

JNZ delay_loop; loop if R4 not 0 


The inner loop with loop counter R5 iterates 65536 
times for the two instructions. Therefore, the initial 
value for the outer loop counter is set to 190, which 
is equal to . Both loops are running backward. The 
total instruction cycles for the nested loop is , which 
is closer to the CPU clock rate, 25 MHz. 


This type of delay is not accurate in a sense that the 
actual instruction counts may differ and may be 
subject to the underlying assembler. An accurate 
time delay may be achieved using timers and 
interrupts. In applications that require accurate 
timing, the approach using timers should be 
implemented. 


Macros 


Assembly macros are user-defined code blocks that 
are used to substitute every macro occurrence in a 


program. Each macro is associated with a name 
with its arguments. Using macros is like assembler 
mnemonic instructions once the macros are defined. 
The assemblers will replace macro names with their 
code blocks using text substitutions. 


The following define a macro to swap two numbers. 
To swap two numbers, it is inevitable to use a 
temporary space (R15). Thus, if R15 is used for 
something else, it has to be pushed to stack and 
restored after the macro. 


Swap MACRO A, B; define a macro swap with ti 
mov.w A, R15; arguments, A and B 

mov.w B, A; swap A and B via R15 

movaw RiS, B s move Rio to 8B 

ENDM; end of macro 


Here the macro name is swap, which will be used to 
call this macro. The arguments A and B will receive 
values passed and used for the macro expansion. 
Because both A and B are source and destination 
operands for some statements in the macro, they can 
only be register addressing or index addressing. The 
assembler directive MACRO and ENDM are used to 
enclose a macro definition. Macro definitions may 
be placed at the beginning of the code segment or at 
the end of an assembly program (after the infinite 
loop statement JMP §$ and before the assembly 
direction END). Note that everything after END will 
not be compiled! The following calls to this macro 
illustrate its uses. 


Swap R4, R5; swap R4 and R5 

MOV #0x0202, R4; set R4 to 0x0202 

MOV #0x0204, R5; set R5 to 0x0204 

swap 0(R4), 0O(R5); Swap words at 0x0202 and 


Marco names by default in IAR are case sensitive. So 
if upper cases are preferable, they should be defined 
accordingly. In the above example, the macro swap 
is called by its name followed by two operands. The 
operands may be registers or index addressing 
operands. The index addressing allows swap two 
words in memory as is shown in the above example. 


This macro uses R15 as a temporarily storage during 
swap. If R15 is used for some other purposes and its 
value has to be retained, R15 may be pushed to 
stack and restored after the swap is done. Here is 
the code for backing up and restoring R15 for stack. 


Swap MACRO A, B; define a macro swap with ti 
PUSH RIS? push -RI5. 6. stack 

mov.w A, R15; arguments, A and B 

mov.w B, A; swap A and B via R15 

mov.w R15, B ; move R15 to B 

POP R15; restore R15 from stack 

ENDM; end of macro 


Assembly macros provide a convenient way for 
programmers to aggregate code for some specific 
computations without compromising performance. 
Each referenced macro name is substituted by the 
actual code. From the programmer’s perspective, the 


macro makes programs readable and maintainable. 
However, the expanded code may take space. Unless 
code size is a concern, using macros is highly 
recommended. 


Since the macros will be expanded on each 
occurrence, the labels used may need to be defined 
locally. If the labels are not declared locally, the 
assembler will flag an error on the second macro 
call. If the labels within a macro are not set to local, 
they may not be used for a jump instruction target 
as the macro definition is used for the assembler 
internally. There is no code space allocated for the 
macro definition itself. Therefore, it does not make 
any sense to jump to a macro definition directly. 


max MACRO A, B; define a macro max with two 
LOCAL L1; declare Ll locally 

CMP.W A, B ; compare A and B 

JNC LLy jump Lt A.B 

MOV.W B, A 

Li: ENDM; end of macro 


It is a good programming paradigm to specify labels 
used in a macro local. So some names for an if-then- 
else structure may be reused. For example, the 
meaningful names used are if, then, else, and 
_endif. 


Procedures and Functions 


A procedure (a.k.a. subroutine) is a function without 
a return value. A function is a group of instructions 
with an associated name and a set of arguments. 
The group of instructions may be executed 
according a set of actual arguments passed at the 
time of the function name is called. A procedure 
may perform some actions whereas a function is 
doing some computation and returning a value to 
the caller. Once a function is defined, it may be 
called just like a macro. The difference is that only a 
copy of a function’s code is stored in the system. 
During a function call, the control will be transfer to 
the function’s code address. At the end of a function 
call, the control will be transferred back to the next 
instruction of this function call in the caller. This 
control flow transfer requires a special CPU 
instruction, e.g., the CALL instruction in MSP430. 


Functions 


A function may be defined at the beginning of a 
code segment or the end of it before the direction 
END, like the places where a macro is defined. The 
name of a function is the label given in its 
definition. The last instruction of a function will be 
the RET instruction, which pops the return address 
for the stack and changes PC’s value. Once a 
function is defined, its name may be called using the 
immediate addressing (prefixed a pond sing # toa 
function name). 


The following example illustrates a procedure that 
delays about three machine cycles. Note that the 
definition is placed outside the address space for the 
main program, which is itself a function to be called 
by the system! 


DelayTwoCycles:; define subroutine before m 
nop; delay one cycle 

nop; delay one cycle 

ret; return to next instruction after call 
call #DelayTwoCycles ; call DelayTwoCycles 

; make sure # is there ; the target address 


The function name is actually a label, which 
declares the starting address of the function. The 
last instruction RET will take care of the return 
address and set the next PC to the return address. It 
is interesting that the call instruction followed by a 
function name prefixed with a pond sign. The pond 
sign indicates an immediate value, which is the 
name (starting address) of the function. 


Parameters 


For a function that implements an algorithm, it is 
often a must to pass parameters over. The function 
will perform computation based on the actual 
parameters. For example, if a function performs 
multiplication upon two numbers, the actual 
numbers will be passed at the time the function is 


called. The function body will perform computation 
based on formal parameters, place holders for the 
actual parameters. Therefore, before a function call, 
the actual parameters are stored in the place 
holders. Then a control transfer is triggered to the 
function. These place holders may be registers, 
memory, etc., which is an agreement between the 
caller and callee of a function. 


Call by Value 


If the caller of a function just wants to pass several 
bytes of data, their values may be passed to the 
function. If the place holders for the values are 
registers, the designated registers will be set to the 
values accordingly. The function will then perform 
computation based on the values in the place 
holders. For example, a function may designate R4 
and R5 as call-by-value parameters, and R5 will 
store the result. The caller will set R4 and R5 with 
actual values, and expect the result in R5. 


Call by Reference 


If a parameter is an array, it is sometimes not wise 
to pass an array by its value because passing them 
over requires quite a few space, and what if only a 
few of them are accessed in computation. So only 


the address of an array is passed to a function. This 
mechanism is called call-by-reference. Call-by- 
reference parameters, however, require an indirect 
access to the actual values. Therefore, it typically 
needs a little extra time in function execution. 


Call by Value-Returned 


Call-by-reference parameters require indirect 
references to their actual value in a function. To 
eliminate this indirect access, the reference of a 
parameter is passed over a function, and the 
function will store its value for further references 
(direct access). At the end of the function, the value 
is written back to its caller via the passed address. 
This method is call-by-value-returned (a.k.a. value- 
result). If the parameter is referenced quite often, 
this approach is very efficient. However, if only a 
handful of references to the parameter, making a 
copy the parameter may not be costly. 


Call by Result 


Call-by-result is almost identical to the call-by- 
value-returned except call-by-result does not pass 
data over a function. That is, the copying upon 
entering the function is not necessary. Therefore, 
call-by-result is more efficient than call-by-value- 


returned. 


Call by Name 


Call by name is the parameter passing mechanism 
used in macros. The call-by-name parameters are 
substituted for the formal parameter during a macro 
expansion. The swap macro expansion is illustrated 
as follows. 


Swap MACRO A, B; define a macro swap with ti 
mov.w A, R15; arguments, A and B 

mov.w B, A; swap A and B via R15 

mov.w R15, B ; move R15 to B 

ENDM; end of macro 

Swap R4, R5; macro call with paramters R4, «© 
The macro call swap R4, R5 with the acutal } 
mov.w R4, R15; R4 -> A, R5-> B 

mov.w R5, R4; swap R4 and R5 via R15 

movaw RLS. Ro -- move’ Ris te RS 


During the macro expansion, the actual parameter is 
used to replace every occurrence of its 
corresponding formal parameter. This text 
substitution may cause problems should there is a 
symbol conflict with the actual parameters. For 
example, if we pass R14 and R15 over the macro, 
the following shows the macro expansion. 


mov.w R14, R15; R14->A, R15->B 


mov.w R15, R14; swap A and B via R15 
mov.w R15, RilS ¢ move RLS to B 


Obviously, the result is incorrect. After the 
execution of the macro, R15 will contain the value 
of R14 but R14 will remain its value. This is due to 
the name conflict on R15. Some languages such as 
Scheme provide hygienic macros, which guarantee 
macros not to cause collision with existing symbols. 
This naming conflict may not be solved simply by 
storing R15 in stack. In the above example, a 
workaround would be eliminating the use of the 
temporary register R15 as follows. 


Swap MACRO A, B; define a macro swap with ti 
xor.w B, A; arguments, A and B 

xor.w A, B; swap A and B using XOR 

xor.w B, A; 

ENDM; end of macro 


The swap algorithm takes advantage of the 
exclusive OR for the equality axioms: and. First, . 
Then . So. At this point, B got A. Applying the same 
operation again will put B in A. 


Parameters in Registers 


Before a function call, the caller has to prepare the 
actual parameters for the function to perform with. 
The storage for these parameters is an agreement 


between the function and its caller. If the size of the 
storage is small (a number of bytes), the fastest and 
easiest way is to pass parameter via registers. The 
following example implements a shift-add 
multiplication algorithm. 


mul: ; shift-add multiplication 

>; R4 * R5 —-> RS 

PUSH R4 ; store R4 

PUSH R6 ; store R6 

PUSH R7 ; store R7/ 

MOV R3, R6 ; 0 -> R6 

MOV @R3+, R7 ; OXFFFF -> R7 (counter) 
mulstart: 

CLRC ; clear Carry 

RRC R5 ; rotate right multiplier 

JNC noAdd ; no add if LSB of RD5 is 0O 
ADD R4, R6 ; R4 + R6 -> RO 

noAdd: CLRC ; clear carry 

RRC R7 ; rotate right via carry 

JNC mulEnd ; done 

CLRC ; clear carry 

RLC R4 ; rotate left multiplicand 
JMP mulStart ; repeat 

mulEnd: MOV R6, RD ; ending put product to ] 
POP R7 ; restore R7 (reverse order) 
POP R6 ; restore R6 

POP R4 ; restore R4 

RET 


The shift-add algorithm performs multiplication by 


right shifting the multiplier, and left shifting 
multiplicand, accumulating the multiplicand if the 
least significant bit (LSB) of the multiplier is non- 
zero in each iteration. The total number of iterations 
is the number bits of the multiplicand and the 
multiplier, which is 16 bits in MSP430. The register 
R6 is used for the partial product, and R7 is 
functioned as a counter for 32 iterations. The 
register R4 carrying the multiplicand and normally 
should not be changed. Because the registers (R4, 
R6 and R7) may be used in the caller, they are 
saved in stack on entry of the function, and restored 
on exit. The value R5 will be overwritten and thus 
its value needs not be saved. 


In this implementation, R4 and R5 are used to carry 
the actual parameters from a caller. Therefore, the 
caller has to store their values before making the 
function call. The following example shows how to 
make a function call with call-by-value parameters. 


MOV #10, R4 
MOV #20, R5 
CALL #mul 


The caller first sets R4 to 10, and R5 to 20, and 
expecting the product to be stored in R5. After 
function call, the values of R4 and R5 will be used 
to perform a multiplication. Since R4 provides the 
multiplicand and its value is saved to the stack on 
entry of the function, it will not be modified after 
the function call. This is compatible to the 


convention that the source operand will remain its 
value after the execution of an instruction in 
MSP430. 


Parameters in Global Variables 


Passing parameter via register is easy but there are 
only a number of registers in most of CPUs. In 
MSP430, there are only 12 registers (4 others are 
special function registers) available. An obvious 
place is the memory. Parameters may be allocated 
in a global memory area accessible to all the 
functions. At the beginning of a data segment 
(DATAI6_N), e.g., in MSP430 is a candidate for the 
global variables. A map should be created for global 
variables stored in the data segment. Once created, 
all the functions will have to follow the agreement 
to access the global variables. Otherwise, an 
unexpected result may occur. Global variables, 
though simple, are not very efficient as each 
memory access is much slower than that of registers. 
This would be the last resort to consider for 
parameter passing. 


Parameters in Stack 


Most of the high level programming languages such 
as the C language, and its derivatives pass 


parameters through stack. The system is allocated 
an activation record for each function call 
(instance). In the activation record, there are actual 
parameters, static link, dynamic link, return address, 
and local variables of the function. Therefore, the 
caller of a function will push actual parameter to 
stack, and then make a function call. The following 
diagram shows the content of an activation record 
after a function call is made. 


Table 13 A Typical Activation Record of a Function 
Call in Stack 


T anal vrariahla 
MVeUL VULIUDIWe 


k= BO 


T anal trariahla 
VeUL VULLUIe 


Datiuien addenca 
EWLULLL UUULLDDYD 


Drrmamin linl 
ay y PAULL 2 


Ctatin linl, 
WeUeLe 14411. 


A nrtisal naramatar 
zartuus purusicrile.:$i 


ka BO 


A nrtisral naramatar 
eartuus purumicrile.:s: 


The actual parameters are pushed in that order 
shown in Table 13. The static link shows the 
relations between the function and its surrounding 
context. The dynamic link indicates the activation 
record that creates this one at runtime. The return 


address is the next instruction address after the 
function call, which is pushed by the call statement 
in MSP430. It is reasonable that the control should 
return back to the caller’s next instruction after the 
function finishes its execution. Local variables are 
defined and used in the function for computations. 


The following example illustrates passing actual 
parameters through stack to a function in MSP430. 
First of all, the caller is in charge of pushing two 
actual parameters to stack, and then makes a 
function call. After the actual parameters are 
pushed, the stack contains the values for s1, and s2 
in that order. The CALL instruction will then push 
the return address to the stack (on top of s1 and s2). 


mulls: - shitt=-add: multiplication 

s ‘sback*- Sly -S2, ra 

PUSH R4 ; store R4 

PUSH R5 ; store r5 

PUSH R6 ; store R6 

PUSH R7 ; store R7/ 

> stack: sl, 82, “ra, R44; Ro, Ro; R7 
MOV R3, R6 ; 0 -> R6 

insert... 

MOV @R3+, R7 ; OXFFFF -> R7 (counter) 
insert... 

MOV 12(sp), R4; get sl to R4 
insert... 

MOV LO(Sp) 5. RSs “Get sz to RS 
mulStart: 


CLRC 
RRC 


y 


R5 


y 


rotate 


clear Carry 


right multiplier 


JNC noAdd ; no add if LSB of RD5 is O 
R6 ; R4 + R6 -> R6 
noAdds: ‘CURC » -Glear ‘carry 


ADD R4, 


RRC 


R7 


y 


rotate 


right via Carry 


JNC mulEnd ; done 


CLERC 
RLC 


y 


R4 


y 


rotate 


clear carry 


left multiplicand 


JMP mulStart ; repeat 


mul 
POP 
POP 
POP 
POP 
MOV 


INCD sp 


RET 


nd: 


R7 
R6 
R5 
R4 


. 
y 
° 
y 
. 
y 


y 


MOV R6, 
restore 
restore 
restore 
restore 


O(sp), 2(sp) 


; rewind 


12(sp); ending put product 1 
R7 

R6 

R5 

R4 

* Move ra next: “to sz 

stack 


Here is the call in the caller side. The caller and the 
function assume the parameters are in the following 
order: s1, s2, ra, where ra is the return address, and 
s1 will hold the return value, which is the product 
of sl times s2. That is after the function returns, the 
top of the stack should be the result. 


PUSH #10 
PUSH #20 


; push 
i. push 


CALL #mul ; call 


POP 


R4 


y 


first actual parameter to st 
second actual parameter 
multiplication function 


pop return value from stack 


Note that the function shown in the above example 


receives parameters from the stack. It first stores 
registers to be used in stack. Then parameters are 
accessed and stored in registers using indexing 
addressing on the stack pointer (SP). Once the 
parameters are set, the computation algorithm is the 
same as a normal shift-add algorithm. The tricky 
part is the ending for stack manipulation. The result 
R6 is first stored to its designated location (s1) in 
stack. Stored registers are then restored. At this 
point, there are s1 (product), s2, and ra (return 
address) on stack. Since the RET instruction will 
access the top of the stack for return address, its 
value has to be put in the location at s2. Finally, the 
stack is rewound to it. So the net result for the stack 
is one element left (the product). 


Parameters in Code Stream 


Normally, embedded processors have a larger 
program memory than data memory. Parameters 
may be passed along with the program code stream, 
i.e., the code memory. Code memory may contain 
data such as immediate values. Most of the CPUs 
support code memory access, and therefore actual 
parameters may be embedded in the code. The 
actual parameters are defined immediately followed 
by the function call instruction. It is also possible to 
place parameters before the function call instruction 
contingent to assembler’s support. However, the 
code memory is normally read-only access. So the 


return value has to be stored elsewhere. That means 
the parameters embedded in the code memory are 
static and may not be run-time values. The 
following example illustrates parameter passing via 
code memory in Silicon Labs IDE for 8051 
processors. 


power: 
mov 0}: Spe: SP. > 20 

mov DPH, @r0; high byte of ra -> DPH 
mova, #1; 

adda, r0; 

Moves 0: as <SP aes does: 0 

mov DPL, @r0; low byte of ra -> DPL 
mova, #0 

move a, @A+DPTR ; get return address 
mov r0O, a ; store parameter Oxab to r0 
inc DPTR ; adjust return address 

mov: 20; «Sp SP’ -—> 60 

mov @r0Q, DPH ; store high byte of ra 
Mow ay: Fl 

add a, r0 

mov rO, a 

mov @r0O, DPL ; store low byte of ra 
ret 


The function power is called using the following 
statement. The caller immediately defines a 
parameter Oxab followed by the function call 
instruction. The data is embedded in the code 
memory and its address is the return address after 


the function call instruction is executed. 
call power; call the function power 
DBOxab; define a parameter Oxab 


The function power receives the parameter by 
referencing to the return address stored in the stack. 
The 8051 supports a set of instruction such as 
MOVC to access its code memory. Since it is 8-byte 
CPU, the 16-bit address has to be assembled to a 
data pointer (DPTR). With the DPTR 16-bit register, 
the code memory can be accessed. The parameter at 
the return address is then read out. 


The last part of the above example is to adjust the 
return address stored in stack. Since the current 
return address points to the parameter, obviously 
not instruction, the return address has to be 
adjusted to bypass the parameter. In this case, the 
actual return address is increased by one. Again, the 
protocol of the parameter map is a contract between 
the function and its callers. 


In the 8051 parameters in code stream example, the 
parameter (Oxab) is defined directly followed by the 
function call instruction. The constant, however, is 
determined at compile time. What if we want to 
pass a variable that holds a runtime value over a 
function using this technique? A solution is to pass 
references of variables defined in the caller. The 
following example implements this idea in MSP430 


using IAR. First of all, two arguments are defined as 
follows. 


RSEG DATA16_N 
argl DW 0 ; parameter 1 
arg2 DW 0 ; parameter 2 


The variables arg1 and arg2 are declared with an 
initial value 0. The caller may manipulate the two 
local variables at its will. Note that the first two 
lines of the following caller code may not have to be 
immediately before the function call instruction. 
The caller and the function (mul) assume that the 
function call instruction is followed by arg1 (one 
word), and followed by arg2 (another word). The 
address of arg1 is actually the return address when 
the function call is made. Therefore, the actual 
return address (next instruction after the function 
call) should be shifted by 4 (two words, two bytes 
each). The return value is assumed to be stored at 
the variable arg1. Therefore, the last line stores the 
result to R4 for further computation. 


MOV #10, argl 

MOV #20, arg2 

CALL #mul ; call multiplication function 
DW argl ; address of argl 

DW arg2 ; address of arg2 

MOV argl, R4 


The multiplication function is implemented with a 
slightly change in parameter passing. The following 


code illustrates parameters in code stream in 
MSP430. It is worth mentioning that the addresses 
of the parameters are passed over the function. The 
function employs the stored return address in stack 
as a reference to get the parameters. Once the 
addresses of the parameters are obtained, their 
actual values are stored in register R4 and R5. 
Similarly, at the end (mulEnd:), the results is 
written back to the caller’s local variable arg1. 
Finally, the actual return address stored in stack has 
to be added by 4 to bypass the two parameters. 


mul: ;- shift-add. multiplication 

; code stream: @ra->sl, @rat+2->s2 
PUSH R4 ; store R4 

PUSH R5 ; store r5 

PUSH R6 ; store R6 

PUSH R7 ; store R7/ 

*» Stack: ra, .R4;, Ro, RG; Ri? 

MOV 8(sp), R6 ; ra —-> R6 

MOV @R3+, R7 ; OXFFFF -> R7 (counter) 
MOV O0(R6), R4 ; get sl's address 

MOV O0(R4), R4 ; get sl's value 

MOV 2(R6), RS ; get s2's address 
MOV O(R5), RDS ; get s2's value 

MOV R3, R6 ; 0 -> R6 

mulstart: 

CLRC ; clear Carry 

RRC R5 ; rotate right multiplier 

JNC noAdd ; no add if LSB of RD5 is O 
ADD R4, R6 ; R4 + R6 -> RO 


noAdd: CLRC ; clear carry 

RRC R7 ; rotate right via carry 

JNC mulEnd ; done 

CLRC 3 clear carry 

RLC R4 ; rotate left multiplicand 

JMP mulStart ; repeat 

mulEnd: MOV 8(sp), R4 ; get ra 

MOV O0(R4), R4 ; get sl's address 

MOV R6, O(R4) ; ending put product to sl at 
POP R7 ; restore R7 

POP R6 ; restore R6 

POP R5 ; restore RD5 

POP R4 ; restore R4 

ADD @SR, O(SP); increase RA by 4 to bypass } 
RET 


Parameter Block 


If there are a lot of parameters to be passed over a 
function, a designated parameter block may be 
used. A parameter block is a group of contiguous 
memory locations for parameters. All the caller 
needs to do is pass the beginning address of the 
parameter block over the function. The function 
then references parameters based on the beginning 
address of the parameter block. In this technique, 
only one parameter (the beginning address of the 
parameter block) needs be passed over the function. 
However, indirect accesses to parameters may be a 


little bit slower than direct access to register. Thus, 
this technique would be the last resort if 
performance is a concern. 


Functions Results 


Functions may return results back to their callers. In 
high level languages, a function without returning a 
value, e.g., void function in C, is called a procedure. 
Procedures are statements whereas functions are 
expressions. However, there is no difference 
between a function and a procedure in assembly 
languages. Normally, a function would return one 
value in high level language. In assembly, however, 
a function may return as many values as there is a 
need. The return value may be placed in register, 
stack, or memory. The following sections detail 
these mechanisms. 


Returning Function Results in a Register 


The fastest way to returning function results is place 
them in registers. Except the special function 
registers such as PC, SP, SR, and R3 in MSP430, all 
registers are eligible for the return values. The 
problem is there is only a handful of registers. It is 
the programmer’s discretion to manage the registers. 
A good approach is reserve a designated set of 


registers for parameters. For example, registers R14 
and R15 in MSP430 may be reserved for 

parameters. Thus, the caller would avoid using them 
in computation. They are only used to receive return 
values from a function. In case there is a need and it 
is inevitable to use the reserved registers, they have 
to be saved in stack and restored later. 


Returning Function Results on the Stack 


High level programming languages normally pass 
parameter through stack for its flexibility. Before a 
function call instruction, the caller has to push 
parameters to stack including the return value. 
Normally, the first push is for the return value, 
followed by parameters. This sequence is important 
as the function will reference parameters and the 
return value “slot” in that order. Based on this 
scenario, should a function returns more than one 
value, it just pushed more return value “slots” in 
stack. 


What’s important is the function has to rewind the 
stack to a state that there are only return value 
“slots.” The process involves moving return address 
on top of the return value slots right before the 
function return. 


Returning Function Results in Memory 
Locations 


Return values may be stored in a designated 
memory location. The approach is identical to 
global variables passing and parameter block 
passing. Global variables once defined are visible to 
functions. They may involve in the computation and 
so does storing results. Likewise, a parameter block 
is defined in a chunk of memory (each individual 
parameter may not be apparently defined). Passing 
the beginning address of a parameter block allows a 
function to access and store computation results in 
it. 


Side Effects 


In a function, registers or variables involved in its 
computation will be modified if they are not saved 
and restored afterwards. This is called side effect. A 
program may be designed to depend on the side 
effect. For example, a caller expects the modified 
value of a register after a function is returned. If the 
function is modified later, the side effect may be 
eliminated. This will cause a hard-to-debug error. 
Practically, computation subject to side effects 
should be avoided. 


Local Variable Storage 


Local variables in high level languages are called 
auto variables, meaning that they are allocated in 
stack and wiped out after the function returns. In 
assembly, temporary storage may be allocated in 
stack. A function may issue several pushes to 
allocate space in stack. Each of them may be 
accessed relatively to the stack pointer (SP). When 
done computation, a matched amount of pops have 
to be issued, or the stack pointer (SP) is adjusted 
accordingly, to rewind the stack. 


Recursion 


Recursion is a natural representation of some 
problems that have recursive definitions. A recursive 
definition typically specifies a recursive relation to a 
problem itself in terms of its problem size, and a 
base that defines the fundamental problem. For 
example, the sum of n integers starting from 1 ton 
is recursively defined as follows: 


This recursive definition for the function sum is 
based on itself with one less than the problem size 
(n-1). The base is required. Otherwise, the 
copulation would never end. So basically the 
following derivation illustrates how to compute . 


When we derive to sum(0), we know its value is 0 
by the based definition. So its value is substituted 
for used in computation . Therefore, . The value of 
is then used to compute , which is . Doing this 
backwards and eventually may be calculated. 


The above recursive function may be implanted in 
any programming language that supports recursion. 
A C implementation is illustrated as follows. 


int sum(int n) { 
if (n=0) 

return 0; 

return n+sum(n-1); 


} 


Note that the base case is first checked in the C 
function, and n is assumed to be a positive integer 
(why?). The function returns a value and takes a 
parameter. Typically, the return value and the 
parameter are allocated in stack. Otherwise, the 
return value may be wiped out by other function 
call instances. However, in this example, the return 
value may be allocated using a global variable. 
Anyhow, the following implementation pushes the 
return value in stack. 


The caller first allocates a return value slot and sets 
a parameter in stack. So the function will have the 
return value, followed by the parameter, and 
followed by the return address. 


PUSH R3 ; reserve return value 

PUSH #4 ; actual parameter 4 

CALL #sum ; call recursive function sum 
POP R4 ; pop parameter 

POP R4 ; pop return value 


The sum recursive function implemented in 
assembly deserves a carefully arrangement for 
parameters and return values in stack. First of all, 
the registers R4 and R4 are saved in stack because 
they are used in the computation in the body of the 
function. The stack pointer (SP) is used to reference 
the parameter and the return value. A rule of thumb 
is to have an exact number of pushes and pops when 
operating the stack in the code. Thus, the stack will 
be rewound back to its previous state as if there 
were nothing happened. This property should be 
maintained throughout the whole program. 
Otherwise, there is a risk to cause an out of memory 
error, which is normally resulted from an 
unmatched number of pushes and pops in stack 
operations. 


sum: ; sum(n)=n+sum(n-1), sum(0) = 0O 
; stack: ret_value, n, ra 

PUSH R4 ; store R4 

PUSH R5 ; store RDS 

; stack: ret _value, n, ra, R4, Rd 
MOV 6(SP), R4 ; get n 

TST R4; test if n = 0 

JZ zero; jump to base case 


MOV R4, RS ; keep n 

DEC R4 ; n-1 -> R4 

PUSH R3 ; allocate slot for ret_value 
PUSH. .R4 -¢ push n=l 

CALL #sum; recursive call to sum 

POP R4; pop parameter 

POP R4 ; get ret_value 

ADD R4, R5; compute n+sum(n-1) 

MOV R5, 8(SP); store return value 

JMP endSum 

zero: MOV R3, 8(SP) ; reutrn O for base case 
endSum: POP R5 ; restore RDS 

POP R4 ; restore R4 

RET 


Low Power Computing 


There are two types of low power design techniques: 
dynamic frequency scaling (DFS), and dynamic 
voltage scaling (DVS). The power consumption of a 
device is proportional to , where is frequency and is 
the supplying voltage. Therefore, lower the 
frequency or voltage will reduce the power 
consumption. MSP430 is designed in a way that 
allows programmers to easily change the frequency 
and voltage. In a later section, an example of using 
the very low frequency oscillator (VLO) in MSP430 
for the clock source will be presented. It will 
provide about 12 KHz clocks (versus to 16 MHz in 


the full power mode), the lowest frequency you may 
get in MSP430. 


Low Power Modes 


MSP430 is one of the off-the-shelf CPUs that support 
low power modes. There is one active mode along 
with five software selectable low power modes. An 
interrupt will wake up MSP430 from one of its low 
power modes. The ISR will be executed in the active 
mode, and MSP430 will return back to its previous 
low power mode after the ISR is finished. The 
following table shows the low power modes in 
MSP430. 


Table 14 Low Power Modes in MSP430 


rower HiGGe €h wer Settings 
Aetive-mede-Gili> AjLelecksuare-active 
Low-power mode 0 CPU is disabled, ACLK 
(LPMO) and SMCLK remain 
atitrn NACT VW ia Aianhlnd 
Ge Liv et) LVENILIAN LU ULVIUYIOU 
Low-power mode 1 CPU is disabled, ACLK 
(LPM1) and SMCLK remain 


active, MCLK is disabled, 


Low-power mode 2 
(LPM2) 


Low-power mode 3 
(LPM3) 


Low-power mode 4 
(LPM4) 


DCO’s dc-generator is 
disabled if DCO not used 


in antiun mada 

CPU is disabled, MCLK 
and SMCLK are disabled, 
DCO’s de generator 
remains enabled, ACLK 
CPU is disabled, MCLK 
and SMCLK are disabled, 
DCO’s dc generator is 
disabled, ACLK remains 
CPU is disabled, ACLK is 
disabled, MCLK and 
SMCLK are disabled, 
DCO’s dc generator is 
disabled, Crystal 
oscillator is stopped 


Entering and Returning from Low Power 


Modes 


An interrupt wakes up MSP430 from any low power 
modes (LPM). On entering an interrupt service 
routine the following tasks will be performed: 


¢ PC and SR are stored on the stack 


* CPUOFF, SCG1, and OSCOFF bits are reset 
automatically 


Resetting CPUOFF, SCG1, and OSCOFF bits means 
putting CPU in the active mode (full power). 
Therefore, by and large, interrupt service routines 
are executed in the full power mode. On existing 
from an interrupt service routine, the following 
operations are executed: 


* Option 1: Both PC and SR are restored. 


* Option 2: PC is restored but SR on the stack 
may be modified and return to a different LPM 
after RETI instruction is executed. 


It is possible to modify the saved SR, which contains 
low power control bits, in a way that the system will 
be running on another low power mode after an 
interrupt service routine is performed. The following 
example illustrates how CPU is woken up from 
LPMO by an interrupt service routine. 


; This example shows how ISR wakes up 
; MSP430 from LPMO. 

; Enter LPMO Example 

BIS #GIE+CPUOFF,SR ; Enter LPMO 

; .«..-. 7; Program stops here 

1 

; Inside ISR 

; Exit LPMO Interrupt Service Routine 
BIC #CPUOFF,0(SP) ; Exit LPMO on RETI 


RETI 


On entry of an ISR, PC and SR are stored in stack as 
shown in Table 15. The stack pointer (SP) is 
pointing to the saved SR. Therefore, the ISR may 
modify its content at its will. In the above example, 
the CPUOFF bit is cleared. By the time, the 
instruction RETI restores PC and SR. It means the 
CPU will be running at the active mode. This 
mechanism is useful and prevalently used in modern 
computing system design, where a key press will 
wake up the system. The key press essentially 
generates an interrupt, which is then delivered by 
the CPU with its power mode modified. 


Table 15 The Stack Content on Entry of an Interrupt 
Service Routine 
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The diagram is from Texas Instruments’ MSP430 
user guide with the file name 
MSP430x2xxUserGuide_slau144e.pdf. 


Clocking 


The design of low power operations in MSP430 lies 
in the software controllable clocks. There are user- 
accessible control bits for the clock sources 
including a main system clock (MCLK), sub-main 
system clock (SMCLK), and auxiliary clock (ACLK). 
There are built-in oscillators that generate clocks 
such as the low frequency/high frequency oscillator 
(XT1), the high frequency oscillator (XT2), the 
internal vary low frequency, low power oscillator 
with 12 KHz (VLO), and the internal digital clock 
oscillator (DCO). Both XT1 and XT2 may generate 
clock with frequencies ranging from 400 KHz to 16 
MHz. Figure 7 describes the basic clock system in 
MSP430. 


The bits, SCG1, SCGO, OSC OFF, and CPU OFF, in 
SR control the CPU clocking. When the SCG1 is set, 
the sub-main clock (SMCLK) is turned off. Setting 
SCGO will turn digital clock (DC) generator. If OSC 
OFF is set, the LXFT1 oscillator is turned off. Setting 
the bit CPU OFF will turn off the main system clock 
(MCLK). There are frequency divisors (1/2/4/8) for 
ACLK, MCLK, and SMCLK. ACLK is software 
selectable from XT1 and VLO and is used for 
individual peripheral modules. MCLK is software 
selectable from XT1, XT2 Gif available), VLO, or 
DCO. 
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Figure 8 The Basic Clock System in 
MSP430[footnote] 


The control signals for vary low frequency oscillator 
(VLO) are stored in the BCSCTL3 register, a SFR, 
and its address is 0x0053. The LFXT1S_2 is a 
constant (0x20) defined in a header. The macro 
should be used instead of the actual value to 
improve program readability. So all we need to do is 
set BCSCTL3 to the LFXT1S_2 value. Since there is 


no need to change other bits in BCSCTL3, the BIS 
right fits into this purpose. The LFXT1S_2 is 
considered as a constant, along with the absolute 
addressing for SFRs. The statement to select VLO is 
depicted in the following 


BIS.B #LFATLS 2,. &BCSCTLS 


A Low Power Design Example 


MSP430 provides a very simple interface to control 
its low power modes. The rule of thumb in low 
power design is this: turn on components while they 
are needed. Most of the time, the system should be 
powered at its minimal energy consumption level 
just to maintain its states. This is crucial especially a 
system is powered by a battery pack. 


In following example, the low power VLO will be 
used for clocking the whole system. Exercised is the 
low power mode 3 (LPM3), in which CPU, MCLK, 
SMCLK, and DCO are disabled. Only ACLK is active. 
To stop DCO, we need to set SCGO and SCG1. Refer 
to the clock diagram in Figure 7, set SCGO to stop 
DC generator. Set SCG1 will disconnect DCO and 
SMCLK. Both SCGO and SCGI1 are in SR. So we may 
use BIS to set them. Since SR is a word, the BIS.W 
instruction is used. Also, setting both SCGO and 
SCG1 is combined to setting SGCO + SCG1. 


#include "msp430.h" ; #define controlled in 
NAME main ; module name 

PUBLIC main ; make the main label visible 

; outside this module 

ORG OFFFEh 

DCLG anit. > Set reset vector to: *init" babe: 
RSEG CSTACK ; pre-declaration of segment 
RSEG CODE ; place program in 'CODE' segment 
init: MOV #SFE(CSTACK), SP ; set up stack 
main: NOP ; main program 

MOV.W #WDTPW+WDTHOLD, &WDTCTL ; Stop watchdo« 


BIS.B #LFXT1S_2, &BCSCTL3 ; select VLO to s« 
; ACLK 

BIC.B #OFIFG, &IFG1l ; clear OFIFG 

BIS.W #(SCGO+SCG1), SR ; turn off DCO, and : 
BIS.B #SELM_3, &BCSCTL2 ; select VLO to sou: 
BIS.W #CPUOFF+GIE, SR ; turn off MCLK 

JMP Samp; ; jump to current location '‘amp$; 
; (endless loop) 

END 


After selecting VLO, we have to clear the OSC fault 
flag in IFG1, interrupt flag register #1, at Ox0002. 
The interrupt flag is defined as OFIFG in the header. 
So we may use BIC to clear it. In MSP430 document, 
this is required. Otherwise, the system won’t work. 
Therefore, the BIC statement to clear the oscillator 
fault flag is added. 


Note that the selected VLO would have to be routed 
to the MCLK. Otherwise, the CPU is not running at 
all because there is no clock into it. That is why the 
following statement is used to select the source of 
the MCLK from VLO. 


BIS.B #SELM_3, &BCSCTL2 ; select VLO to s« 


The statement that turns off MCLK, i.e., turns of 
CPU, should be placed after the instruction for other 
settings such as selecting VLO. Be aware that the 
system will be in “sleep” once the CPU is turned off. 
Also, the global interrupt should be enabled before 
putting CPU to a low power mode. Or the CPU will 
not be awoken because it will not receive any 
interrupt, which activates CPU. 


Interrupts 


Interrupts are a mechanism to notify CPU when 
some urgent events occur. For example, when a user 
presses a key on a keyboard, the keyboard hardware 
will generate an interrupt to notify CPU of this 
event. The CPU will 1) stop the current execution, 
2) read the user’s input, and 3) resume its execution. 


Interrupt Classification 


Interrupts are classified to the following types: 
system reset, non-maskable interrupt, and maskable 
interrupt. Not all interrupts are with the same 
priority. The system reset has the highest priority, 
meaning that if there are pending interrupts, the 
system reset will be the first one to be delivered. 
Some interrupts may be masked (blocked), meaning 
that they will be ignored. Others may not be masked 
or ignored such as oscillator fault or access violation 
to flash memory, which is a non-maskable interrupt. 
Maskable interrupts includes global interrupt enable 
(GIE), individual interrupt enable (IE), and 
peripheral I/O such as general purpose I/O (GPIO). 


Interrupt Service Routine and Interrupt 
Vector 


When CPU is aware of an interrupt and it has to be 
delivered, a stored routine will be executed. This 
stored routine is called interrupt service routine 
(ISR). ISRs are programmed to suit an application’s 
need. The design of an ISR is identical to a 
subroutine except the return instruction should be 
“RETIT” not “RET.” They may by stored anywhere in 
the code memory but their beginning addresses are 
stored in a designated location called interrupt 
vector. 


The interrupt and reset vectors handle interrupt 
requests and system reset. In MSP430, one interrupt 


vector requires one word. So totally there are 32 
interrupts allowed. The assigned address space is 
from OxFFCO to OxFFFF. Each of the vectors stores 
the starting address of the corresponding interrupt 
service routine (ISR). For example, when there is an 
I/O interrupt request such as I/O data ready, the 
CPU will look for the corresponding interrupt vector 
and serve the I/O request by running its ISR. 
Interrupt provides high performance I/O, and is the 
fundamental mechanism for a multiprocessing 
system. 


A vector is programmed by the user with the 
address of the corresponding interrupt service 
routine. It is recommended to provide an interrupt 
service routine for each interrupt vector that is 
assigned to a module. A dummy interrupt service 
routine can consist of just the RETI instruction and 
several interrupt vectors can point to it. If an 
interrupt vector contains an erroneous value due to 
an inappropriate initialization, the system may 
result in an “unstable” behavior. Because the 
interrupt may occur unexpectedly, the “odd” 
behavior is also not predictable. Therefore, it is 
really hard to find out this type of errors. It is highly 
recommended that the interrupt vector should be 
initialized during the system reset. 


Interrupt Acceptence 


Interrupts notify CPU of external events and CPU 
will response immediately. By “immediately,” it 
does not mean CPU will stop execution at the same 
clock cycle while interrupt is generated. If the 
interrupted instruction requires several clock cycles 
for its execution, the CPU will finish this instruction 
cycle. In other words, an interrupt is delivered in- 
between normal instructions. Once an interrupt is 
accepted, CPUs normally will perform a context 
switch and the corresponding ISR is loaded for 
servicing the interrupt request. Different CPUs may 
react to an interrupt slightly differently. The 
following illustrates how MSP430 responses to an 
interrupt. 


Any currently executing instruction is 
completed. 


The PC, which points to the next instruction, is 
pushed onto the stack. 


The SR is pushed onto the stack. So the current 
context is saved. 


The interrupt with the highest priority is 
selected if multiple interrupts occurred during 
the last instruction and are pending for service. 


The interrupt request flag resets automatically 
on single-source flags. Multiple source flags 
remain set for servicing by software. 


* The SR is cleared. This terminates any low- 
power mode. Because the GIE bit is cleared, 
further interrupts are disabled. This is the way 
MSP430 being waken up for low power modes. 


* The content of the interrupt vector is loaded 
into the PC. The program continues with the 
interrupt service routine at that address. 


¢ On finishing execution of the ISR, the last 
instruction of an ISR, i.e., RETI, will restore SR, 
and then the PC. So the program continues its 
execution as if there were no interrupt. 


1. Interrupt Processing 


When an interrupt is delivered, the hardware will 
push PC and SR to stack. The PC is uploaded with 
the content of the corresponding vector, which is 
the beginning address of the ISR. The actual ISR 
code may be stored elsewhere but the assembly 
programmers may set the interrupt vector to 
predesigned ISRs. ISRs are developed by the 
assembly programmers. By and large, the ISR should 
not be too long. 


Return from Interrupts 


ISRs are similar to subroutines function-wise. 
Subroutines are called by the program whereas ISRs 


are called by the system. In MSP430, the instruction 
CALL for subroutines does not save SR. Only the 
next instruction address (PC) is saved in stack. On 
return, the RET instruction for subroutines only 
restore the save PC from the stack. ISRs, however, 
will have to restore both SR and PC. This is the 
reason why the return instruction for ISRs has to be 
RETI (return from an interrupt service routine) not 
RET. 


In MSP430, the RETI instruction takes 5 cycles 
(CPU) or 3 cycles (MSP430x) to perform the 
following actions: 


* The SR with all previous settings pops from the 
stack. All previous settings of GIE, CPUOFF, 
etc. are now in effect, regardless of the settings 
used during an interrupt service routine. 


¢ The PC is restored from the stack. 


* The program resumes execution at the point 
where it was interrupted. 


Interrupts in Interrupts 


It is possible to interrupt an interrupt, which is 
currently under delivery. That is an interrupt is 
requested while an ISR is running, resulting in a 


nesting interrupt. Since the context is saved for each 
interrupted program, interrupting an interrupt is 
like a recursive function. This may consume quite a 
few of memory. In systems where memory is not 
abundant, this type of recursion should be avoided. 


Interrupt nesting in MSP430 is permitted. Interrupt 
nesting is enabled if the GIE bit is set inside an 
interrupt service routine. When interrupt nesting is 
enabled, any interrupt occurring during an interrupt 
service routine will interrupt the routine, regardless 
of the interrupt priorities. 


Caveats in Using Interrupts 


Interrupts are powerful and important to computer 
systems. Almost all CPUs come with interrupts. They 
are simple to be integrated into applications. 
However, they may be a bug source which is fairly 
hard to debug in systems that do not provide 
debugging interface for interrupts. In IAR, it 
provides interface via JTAG to debug programs, and 
a break point may be set inside an ISR. However, 
there are caveats in using interrupts as follows. 


* Keep interrupt service routines short. A lengthy 
ISR may cause some problems. For example, 
the responsiveness of main program may be 
impeded. Running ISR means the main 
program is suspended. A length ISR may also 


block further urgent interrupts if interrupts in 
an interrupt are not allowed. However, if the 
interrupts in an interrupt are allowed, memory 
space may be a concern. Therefore, the ISR 
should be the shorter, the better. If there is a 
need for some lengthy computation, the ISR 
may set a flag to notify the main program for 
further processing. 


Define all interrupt vectors. Not all interrupts 
are used in an application. For unused 
interrupts, their vectors are reserved by default. 
It is pretty much like a global variable is 
defined for the system, but its value is not 
initialized. By the time, the system involves the 
variable in some computation. Obviously, the 
result is not predictable. The interrupt vector 
contains the address of an ISR. If its content is 
not initialize to an effect address of an ISR, the 
system will execute something unexpected 
should the corresponding interrupt were 
delivered. In most cases, an invalid address will 
occur and the system halts. 


The shared data problem. Interrupts are the 
foundation of multiprocessing. In the basic 
form where there is a main program with an 
ISR, this creates an image of two execution 
threads: the main program, and the ISR. If the 
two threads share an object, the object would 
have to be protected. Otherwise, data 


integration would be violated. For example, if 
the shared object is a four byte integer (two 
words in MSP430), each operation on the 
integer should be applied to all four bytes. 
However, e.g., the main program may access 
the low word, and an interrupt occurs, the ISR 
then modifies the integer. By the time, the 
main program resumes, the high word is 
combined with the unmodified low word. 
Obviously, the integer is not the one used to be, 
nor the one after the modification. To solve this 
problem, a mechanism of something like 
semaphores should be implemented. 


Timers 


Timers in computer systems play an important role 
in providing timing information such as wall clock 
time, a delay that is inevitable, time sharing control, 
synchronization on two processes, and time stamps 
in documents. Most microprocessors are equipped 
with timers. There are different a variety of timers. 


¢ Watchdog: protect system against failure 
* Basic timer: interval timer for LCD 


* Real-time timer: provide wall clock 


* Timer_A: generate interrupts, external inputs, 
drive outputs, sampling, etc. 


¢ Timer_B: similar to Timer_A 


Timer_A is identical in all MSP430 family. That 
means if a program built on top of timers, and is to 
be developed for running on all MSP430 devices, 
Timer_A should be selected. Otherwise, some 
devices may not be equipped with the implemented 
timer, and the program may be running at that 
particular system. 


Timer A 


The Timer A in MSP430 is composed of the 
following two main components: 


* Timer block: A 16-bit TAR register is in charge 
of counting. 


* Capture/compare channels: A register TACCRn 
keeps a preset count value. 


* Capture: record the time (TAR) at which the 
input changes in TACCRn 


* Compare: the current value (TAR) with 
TACCRn and update the output 


¢ Request an interrupt by setting CCIFG in 
TACCRn 


* Sample an input at a compare event 


A control register TACTL is used to change the 
behavior of the timer A. The value stored in the 
register TACCRO is used to compare with. TAR 
basically is a counting register. On each rising edge 
of a clock, it increases its value by one. So if we 
want the timer to count 100 cycles, the TACCRO will 
be set to 100. The TAR will count its value from 0, 
1, 2, 3, and so forth. On each count, the capture/ 
compare channel will compare the values in 
TACCRO and TAR. If they have the same value, then 
the timer will generate a timer interrupt. 


Clock Sources 


The counter TAR in the Timer A requires a clock 
source, which may be selected from ACLK or 
SMCLK. External clock source may be fed via TACLK 
or INCLK. The clock may be further divided by 2, 4, 
or 8 via the IDx bits in the TACTL control register. 


Timer Operations 


When the clock source is active, and the mode 


control MCx>0, the TAR starts counting. The MCx 
in TACTL determines one of the four operating 


modes listed as follows. 


Table 16 Timer Operating Mode in MSP430 


00 Stop 

01 Up 

10 Continuous 
11 Up/down 


Description 
The timer is 
Lh altnaadA 


LLULLEU Ue 


The timer 
repeatedly 
counts from zero 


to the value of 
T ACCDN 


ff ANJWIELLV 


The timer 
repeatedly 


counts from zero 
to OUCTLETHL 


Reaueas 


The timer 
repeatedly 
counts from zero 
up to the value 
of TACCRO and 
back down to 
zero. 


In the UP mode, the timer starts counting from 0. 
When the TAR reaches TACCRO, the TAR will be set 
to O and starts over again. Meanwhile, a timer 
interrupt will be generated. Therefore, the period of 
UP mode is equal to . If the TAR has a value that is 
larger than TACCRO, TAR will be set to 0 when the 
UP mode is selected. 


The continuous mode provides a mechanism to an 
implement interval timer. Under this mode, the TAR 
keeps counting from O to OxFFFF. The capture/ 
compare registers TACCRO, TACCR1, and TACCR2 
store the count values information for each period. 
For example, if an interval timer of 100 cycles is 
implemented, the TACCRO is set to 100 for the first 
period. The timer counts from 0 to 100, and an 
interrupt is generated at clock 100. In the interrupt 
service routine, the TACCRO is added by 100, which 
is 200, i.e., the total counts for the first and the 
second periods. For the second period, the timer will 
count from 101, 102, ..., 200. An interrupt is 
generated at clock 200. By doing this, an interval 
timer of 100 clock cycles is created. Similarly, 
TACCR1, and TACCR2 may be used for the other 
two independent intervals. 


In the Up/Down mode, the timer is counting from 0 
to TACCRO and back down to 0. The Interrupt flags 
for the TACCRO CCIFG and the TAIFG are set 
differently. The TACCRO CCIFG interrupt flag is set 
when the timer counts from TACCRO — 1 to 


TACCRO, whereas the TAIFG is set when the timer 
finishes counting down from 0x0001 to 0x0000. 


A Timer Example 


A timer is typically used with an interrupt service 
routine. The timer is in charge of timing whereas 
the interrupt service routine implements a system 
behavior for the event. The following example 
illustrates the use of a timer operating in the UP 
mode with an interrupt generated about once per 
second. In the interrupt service routine, an output 
port is toggled, which in turn toggle a connected 
LED one second at a time. 


#include "quot;msp430.h"quot; ; #define cont 
NAME main ; module name 

PUBLIC main ; make the main label vissible 

; outside this module 

ORG OFFFEH 

DCLO ‘Tinie: # set Leset vector To. sini" Labe: 
RSEG CSTACK ; pre-declaration of segment 
RSEG CODE ; place program in 'CODE' segment 
init: MOV #SFE(CSTACK), SP ; set up stack 
main: NOP ; main program 


MOV.W #WDTPW+WDTHOLD, &WDTCTL ; Stop watchdo« 
BIS.B #1, &P1DIR ; set P1.0 to output mode 
BIS.B #LFXT1S_2, &BCSCTL3 ; select VLO to s« 
BIC.B #OFIFG, &IFG1l ; clear OFIFG 

BIS.W #(SCG0O+SCG1), SR ; turn off DCO, and : 


BIS.B #SELM_3, &BCSCTL2 ; select VLO to sou: 


MOV.W #CCIE, &TACCTLO ; enable TAO interrupt 
MOV.W #12000, &TACCRO ; set timing count 
MOV.W #TASSEL_1+MC_1, &TACTL ; set ACLK for 
; UP mode 

BIS.W #CPUOFF+GIE, SR ; turn off MCLK 

JMP Samp; ; jump to current location 'Samp; 
; (endless loop) 

TAO_ISR:; interrupt service routine for time 
XOR.B #001h, &P1OUT ; toggle LED on pl.0O 
RETI 

COMMON INTVEC ; specify interrupt vector sex 
ORG TIMERAO VECTOR ; set PLC to TA's interrt 
DW TAO_ISR ; assign ISR address 


END 


The interrupt service routine (TAO_ISR) is composed 
of two instructions: the XOR and the RETI. The XOR 
toggles the LED connected on the first bit of port 1 
in EZ430-F2013 USB dongle. The RETI instruction is 
used for an ISR. Most importantly, the ISR has to be 
registered to the interrupt vector of the timer. The 
following code extracted from the above code will 
register the ISR. 


COMMON INTVEC ; specify interrupt vector 
segment 


COMMON INTVEC ; specify interrupt vector se 
ORG TIMERAO_VECTOR ; set PLC to TA'apos;s i1 


DW TAO_ISR ; assign ISR address 
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Memory System 
1. Introduction 


After the emergence of stored program computer 
architectures, the memory plays an important role 
in a computer system not just keeping programs in 
the system, but also allowing larger programs to be 
effectively executed. Other program data such as 
global variables, temporary auto variables, and 
dynamic objects are all stored in the memory. A 
good memory system will have a great impact on 
the overall system performance. 


In this chapter, we will introduce the memory 
system in a typical contemporary computing 
platform. Topics include storage systems and their 
technology (semiconductor, magnetic), storage 
standards (CD-ROM, DVD, and Blu-Ray), memory 
hierarchy, latency and throughput, and cache 
memories - operating principles, replacement 
policies, multilevel cache, cache coherency. Upon 
completing reading this chapter, readers will be able 
to identify the memory technologies found in a 
computer and be aware of the way in which 
memory technology is changing, appreciate the 
need for storage standards for complex data storage 


mechanisms such as DVD, Blu-Ray, etc., understand 
why a memory hierarchy is necessary to reduce the 
effective memory latency, appreciate that most data 
on the memory bus is cache refill traffic, describe 
the various ways of organizing cache memory and 
appreciate the cost-performance tradeoffs for each 
arrangement, and appreciate the need for cache 
coherency in multiprocessor systems. 


Storage Systems and Their Technology 
1. Primary Storage 


Before a program may be executed, it has to be 
loaded to the main memory, which is directly 
accessed by CPUs and thus called primary storage. 
The primary storage also includes registers, and 
cache memories. Components in the primary storage 
are made of semiconductors. The main memory is 
built from dynamic random access memory 
(DRAM), which is cheaper than static random access 
memory (SRAM). DRAM uses charges in millions of 
capacitors within very large scale integrated circuit 
(VLSI) to keep digital information. Each capacitor 
has either charges or not corresponding to logical 
high and logical low. However, the charge on a 
capacitor fades unless the capacitor is refreshed 
periodically. This is why “dynamic” is used in it 
name. On the other hand, SRAM uses D-type flip- 
flops, which do not need to be refreshed, and 


therefore access time for SRAM is faster than DRAM. 
Each bit in DRAM requires one transistor and one 
capacitor whereas each bit in SRAM requires four or 
six transistors. Thus, the capacity of DARM is much 
higher than that of SRAM. Both DRAM and SRAM 
are volatile memory because they lose data when 
the power is off. In CPU, SRAM is built in to the 
same chip and used for cache memories. Figure 1 
shows the hierarchy of storage in a computer 
system. 


Primary Storage 


Main Memory 
(DRAM) 


: Cache 
Registers (SRAM) 


1/O Channels (IDE/SATA/USB) 


Secondary Storage Offline Storage 


Hard Drives CD/DVD/ 
Blue-Ray 


Figure 1 Hierarchy of Storage in Computer Systems 
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1. Secondary Storage 


Storage that CPU has no direct access is classified to 
secondary storage including hard drives, floppy 
diskettes, etc. Floppy diskettes (8”, 5 4” and 3 %”) 
are seldom used nowadays because of their low 
capacity and low speed. Floppy diskettes use thin 
flexible magnetic storage medium to keep 
information. Their typical capacities are 360 KB, 
720 KB, and 1.44 MB, with a transfer rate up to 
1000 Kb/s. The latest development in floppy 
technology is lomega’s LS-240, which employs laser 
to guide a tiny magnetic head in order to increase 
its capacity to 240 MB. However, the cost dropped 
in CD ROM and flash drives caused Iomega’s floppy 
diskettes obsolete around 2000. 


1. Hard Disk Drives 


Hard drives are the major components used in the 
secondary storage. Since it is secondary, 
infrequently used data or users’ backup files are 
stored. When CPU needs data from the secondary 
storage, they have to be loaded to the primary 
storage. In terms of speed, the secondary storage is 
1000 times slower than the primary storage. 
However, the space could be extremely large. The 
capacity of hard drives reaches 4 Terabytes in 2012. 


A hard disk drive (HDD) is composed of one or more 
rotating “hard” discs coated with magnetic material 


spinning at high speeds (4,800 — 15,000 RPM, 
typically 7,200 RPM), and magnetic reading/writing 
heads. Data are stored in the magnetic material, and 
accessed by the reading/writing heads. Hard disk 
drives were first developed by IBM for a real-time 
transaction processing system running on general 
purpose mainframe and mini computers in 1956. 
The 350 RAMAC, IBM’s first hard disk drive, was 
about the size of two refrigerators with a capacity of 
5 million 6-bit characters (3.75 MB), and with 50 
discs. Form factors of hard disk drives include 8, 
5.25, 3.5, 2.5, 1.8, 1, and 0.85 inches. Most desktop 
computers use 3.5” hard disk drives whereas most 
laptop computers use 2.5”. Figure 2 illustrates an 
example of the organization of discs inside a hard 
disk drive. In this example, there are 4 discs stacked 
vertically with totally 8 surfaces. Each surface may 
be used to store data, accessed via the read/write 
heads. The read/write heads are movable via an 
arm controlled electrically. Each disc is subdivided 
into tracks and each track is further divided into 
sectors. Several sectors are formed a cluster, which 
is the basic allocation unit for storing files. It is 
worth mentioning that all discs are attached to the 
axle, spinning together, and all read/write heads are 
moved altogether. Since the time for moving read/ 
write heads are much longer than the data transfer 
time. It would be beneficial to carefully arrange the 
way data are stored among the surfaces to achieve 
high performance, which occurs when the read/ 
write head movement is minimized. 


Surface 7 —_——,», Rai /Write Head (one persurface ) 

Surface 6 ———” ~ 

Surface 5 = 

Surface 4 _ st Direction of Arm Movement 


Surface 3—__. 
Surface 2 ———” 


Figure 2 An Example of the Organization of Discs 
inside a Hard Disk Drive 


Organization of tracks and sectors (blocks) is 
depicted in Figure 3. The tacks are numbered from 0 
outwards. Each track is further divided into equal 
sized sectors. The type size of a sector is 512 Bytes. 
A collection of contiguous sectors forms a cluster, 
which is the basic allocation unit in O.S. to request 
space for a file. A typical size for a cluster is 1 — 32 
KB. For example, a cluster in FAT 32 is 32 KB. 
Larger clusters will waste disk space as the last 
cluster only got 50 % utilization on average. Should 
all files on a disk are small, large clusters make 
things worse. Like other resource management, O.S. 
keeps the allocation status of the sectors. There are 
two ways: free list and bit map. The free list uses 
three tuple (track number, sector number, number 
of free sectors) to keep a hole, i.e., contiguous free 
sectors. The bit map approach uses a bit vector to 


keep free sectors for each track. A free sector will 
have a corresponding bit cleared in the bit vector. 


oe i Sa 


Figure 3 Organization of Tracks and Sectors on a 
Disk of a Hard Disk Drive 


1. Storage Standards 


In this section, we will study storage standards such 
as CD-ROM, DVD, and Blu-Ray. These are 
commonly used in computer systems to store or 
backup data. Much attention will be paid on their 


sizes, capacities, transfer rates, and related interface 


standards. 


1. CD-ROM 


A CD-ROM is a pre-pressed compact disc that 
contains data or music information to be read via a 
CD-ROM drive, and it is not writeable. It comes with 
either 8 cm or 12 cm in diameter. Recordable (CD- 
R) and rewritable (CD-RW) discs are produced to 
allows data to be recorded by a laser changing the 
properties of a dye or phase transition material in a 
process referred to as “burning.” A CD-ROM sector 
is composed of 98 24-Byte frames (totally 2,352 
Bytes). A standard 74 minute CD contains 333,000 
sectors. Due to different error correction schemes, a 
sector contains 2,048 bytes of PC data (mode 1), 
2,336 bytes of PSX/VCD (mode 2), or 2,352 bytes of 
audio. On a 74-minute CD, one can burn a larger 
image using raw mode for up to 747 MB 
(2,352x333,000). An image is always a multiple of 
2,352 bytes when extracting in raw mode. 
Capacities of compact discs are listed Table 1. 


Table 1 Capacities of Common Compact Discs 
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Table 1 Capacities of Common Compact Discs 


Type Sectors Data Sie Audio Time (m) 
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DVD 


DVD, developed by Philip, Sony, Toshiba, and 
Panasonic in 1992, is a higher capacity optical 
storage format than CD. DVD-ROM is a pre-recorded 
using a molding machine that physically stamps 
data onto DVD. Like CD-ROM, DVD-ROM is read- 
only and not writable. Its size is either 8 cm or 12 
cm in diameter. One-time writable DVDs include 
DVD-R, and DVD+R. Multiple time re-writable 
DVDs include DVD-RW, DVD+ RW, and DVD-RAM. 
The emergence of DVDs is mainly to storage large 
movie files. Prior to DVDs, Video CD (VCD) was the 


first digitally encoded films stored in a 120 mm 
optical discs in 1993. Though VCD is low cost and 
successful in the market for a couple of years, there 
is no means to prevent it from unauthorized copies. 
Plus, a typical movie would require 2 discs. DVD 
with a copy protection mechanism and a higher 
capacity was developed in 1995. DVDs are 
manufactured with single sided (SD), double sided 
(DS), single layer (SL), and dual layer (DL). Table 2 
lists capacities of DVDs. 


Table 2 Capacities of DVDs 
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The recording times of DVDs depend on drive speed, 
and data transfer time. With an 18x DVD-ROM 
drive, it takes 3 minutes to burn a 4.37 GB DVD 
disc. Note that the actual recording time is subject 
to a computer system configuration such as size of 
the main memory, performance of hard disk drives, 
etc. 


Table 3 Drive Speed, Data Transfer Rate, and 
Recoding Time for a 4.37 GB Single Layer DVD Disc 


Drive Speed Data Rate (ME) Recording Time 
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Peripheral Interfaces 


Interface standards for storage devices include SCSI 
(Small Computer System Interface), IDE (Integrated 
Drive Electronics), and SATA (Serial advanced 
technology attachment). SCSI, developed in 1978, is 
a set of a set of standards for physically connecting 
and transferring data between computers and 
peripheral devices, especially hard disk drives and 
CD ROMs. Up to 16 SCSI devices can be connected 
on a single bus. It is popular in mainframe 
computers but not in desktop or laptop computers. 
IDE was developed by Western Digital mainly for 
IBM PC/ATs. This standard is also known as ATA, 


IBM PC/AT attachment because it was designed to 
directly connect to PC/AT via the 16-bit ISA bus. 
The first IDE hard disk drive appeared in Compaq 
PCs in 1986. SATA is a serial bus used for mass 
storage devices such as hard disk drives or optical 
devices (DVD, and Blu-Ray) in replacing the IDE 
standard. Most of the PCs sold after 2011 are 
equipped with hard disk drives and DVD ROMs with 
the SATA interface. The transfer rates of SCSI, IDE/ 
ATA, and SATA are subject to difference revisions as 
follows: 


Table 4 Data Transfer Rates in different Peripheral 
Device Standards 
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CDs use 780 nm near-infrared lasers. 


Blu-Ray 


Blu-Ray disc (BD) is developed to replace the DVD 
format. Its size is the same as CD or DVD. They all 
have either 8 cm (Mini-BD) or 12 cm in diameter. 
Each Blu-Ray disc may have single layer, dual 
layers, triple layers, or quadruple layers. Instead of a 
650 nm red laser used in DVD, employed in Blu-Ray 
is a 405 nm blue laser (shorter wave length), which 
allows information to be retrieved and stored in 
higher density than DVD [footnote]. The capacities 
of Blu-Ray range from 7.8 to 128 GB. By and large, 
the higher capacity of Blu-Ray allows it to store high 
definition of motion pictures than DVD. The first 
Blu-Ray disc was unveiled in 2000 by Sony, and the 
first player prototype was introduced in Japan in 
2003. In 2006, the Blu-Ray was officially 
announced. A statistics shows that there were 2,500 
BD titles in Australia and United Kingdom, 3,500 in 
United States and Canada, and 3,300 in Japan in 
2011. Due to its success, Toshiba was forced to 
abandon its HD DVD, and produced its Blu-Ray 
player in 2009. 


Blu-Ray recording time is subject to a data transfer 
rate (drive speed). For a single layer (25 GB), the 
recording time will be about 90 minutes at a data 
transfer rate of 4.5 MB. Table 5 lists the data rates 
and recording times for difference Blu-Ray drive 
speeds. Currently (2012), the internal BD-ROM 
drive costs $50 for 12x, and $70 for 14x. With these 
drive speeds, it takes about 7 minutes to make a BD- 
ROM copy for a single layer 25 GB BD disc. 


Table 5 Blu-Ray Data Rates and Recording Times for 
Single Layer (25 GB) Discs 


Drive Speed Data Rate (ME) Recording Time 
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High-definition videos may be stored on BD-ROMs 
with up to 1920 by 1080 pixel resolution at 29.97 
frames per second. Audio in BD players is required 
to support Dolby Digital (AC3), DTS, and linear 
PCM. The maximal data transfer rate, the maximal 
audio and video bitrate, and the maximal video 
bitrate for BD films are 54 Mb/s, 48 Mb/s, and 40 
Mb/s, respectively, which are much higher than HD 
DVD movies. 


Data Transfer Rates 


One of the performance metrics of storage devices is 
data transfer rate. The higher data rate, the shorter 
the data access time. Data transfer rates for 
commonly used storage devices are listed Table 6. 


Table 6 Data Transfer Rates for Removable Optical 
Storage Devices 


Pa 
PaiiSi€r Nate eH 


a! 


Type 
cn Iw ~. T9~ 


1 ota 


NWN 107 — OAw 


wyvry an oa 1a 


wi dy —~. ON. 


Dlis Da 
wi away ava 


Cc 
Hard Disk Drive ~ 300 


rr? 


Media Capacities 


A comparison on capacity for different storage 
media is listed in Table 7. Although hard disk drives 
have the highest capacity among others, they are 
expensive and prone to damage due to their 
electrical characteristics. Anyhow, data stored in 
hard disk drives have to be stored (backup) on 
removable media. In this regard, Blu-Ray provides 
higher capacity with a reasonable cost. 


Table 7 Capacity of Storage Media 
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Memory Hierarchy 


Memory or storage devices are used to store data to 
be processed by computers. In general, memory 
components share some common characteristics, 
which are fast access, expensive, but low capacity, 
or slow access, cheap, but high capacity. It would be 
great to exclusively having the former type of 
memory components for a computer system. 
However, that will be extremely expensive and near 
impossible to produce the computer. Due to locality 
(both temporal and spatial) of references for a 
running program, it is typical that a small portion of 
code is needed to be stored in fast memory. In order 
to balance performance and cost, memory 
components (Registers, SRAM, DRAM, hard disk 
drives, removable media) are organized in a 
hierarchical manner that high performance 
expensive components are sitting on top of the 
hierarchy whereas less expensive components form 
high capacity storage on the base. Figure 4 
illustrates the memory hierarchy in computer 
systems. In general, data movement occurs when a 
piece of data is required for CPU computation. The 
requested data not existing in a level will have to be 
moved from lower levels. Eventually, the data will 
be stored in the highest level, i.e., registers, and 
ready for computation. Most of the components in 
the memory hierarchy have been discussed in the 
early sections expect cache, which will be studied in 
a later section. 


Figure 4 Memory Hierarchy for Computer Systems 


Cache 


Cache is a high speed, expensive, volatile memory 
component that stores frequently used data for CPU. 
The existence of cache greatly improves a computer 
system performance because it serves as a buffer in- 
between the main memory (RAM) and the registers. 
There is only a handful of registers, and thus they 
may not accommodate frequently used data. As a 
result, the CPU will have to retrieve the frequently 
used data from the main memory (much slower than 
cache) should there were no cache. Because of 
locality of references, the small cache keeps 
frequently used data for CPU with very low latency 
(almost as fast as registers). It turns out that CPU 
would not get the data from the cache and the long 
main memory latency will be avoided. 


Principle of Locality 


The principal of locality involves two dimensions: 
temporal and spatial. A running program typically 
access a small portion of their address space at any 
time. A typically program is composed of loops, 
sequential statements, decisions, etc. Among the 
statements, those that reside in a loop will be 
executed frequently and iteratively. A recently 


executed statement in a loop will be likely executed 
again soon. This refers to temporal locality such as 
statements and induction variables in a loop. 


Data structures such as arrays that pack data 
together to be processed by a program create a 
spatial relation among data. A data item in the 
neighborhood of a recently accessed datum will 
likely be retrieved soon. This refers spatial locality. 
For example, a bubble sort algorithm working on an 
array will access the datum in the array 
sequentially. Therefore, the data items in the array 
share the spatial locality. Additionally, statements in 
a program are access sequentially when running the 
program. The next instruction will be fetched after 
the current instruction finished its execution. 


Taking Advantage of Locality 


Memory hierarchy is developed according to the 
principle of locality. Basically, information is stored 
in hard disks. When program is in execution, 
recently accessed or nearby data are copied from 
hard disks to main memory (DRAM). The more 
recently accessed or nearby data are then copied 
from the main memory to smaller SRAM (cache). 
The requested data are then copied to from cache to 
registers. 


Operating Principles 


Data are decomposed into blocks (a.k.a. lines), each 
of which is the basic copying unit of multiple words. 
If the requested block is present in a level in the 
memory hierarchy, a data hit occurs (the data 
request satisfies). The hit ration is calculated by the 
number of hits divided by the total number of 
accessed. A typical hit ratio is above 90% in all 
levels. If a requested block is not present in a level, 
a data miss occurs (the requested block may reside 
in lower levels). The time required to bring the 
requested block from lower levels to the current 
level is called miss penalty. It is worth mentioning 
that a chain of data misses may happen if a 
requested block is not stored in the level that is 
immediately below. The miss ration is calculated by 
the total number of misses divided by the total 
number of accesses. The relation of hit ration and 
miss ration is as follows: 


Normally, O.S. will bring the requested block to the 
current level after a data miss occurs. Once the 
requested block is in place, the operation that 
caused the miss resumes. In the cache memory, for 
example, if there are 4 blocks with 3 blocks 
currently occupied, a newly requested block (not 
present), say block 5, will be copied to the empty 
slot as depicted in Figure 5. 
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Figure 5 An Example of Requesting Blocks in a Four 


Block Cache 
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Direct Mapped Cache 


The cache is always smaller than the memory. If we 


divide cache and memory into blocks, obviously we 


can only keep a small number of blocks in cache. 
The majority of blocks are still in memory. Each 
memory block is copied to a cache block whenever 
necessary. Thus, it is inevitable to share a cache 
block for many memory blocks, meaning that one of 
the many memory blocks will used the cache block 
at any moment. For example, if there are 16 
memory blocks, and there are 4 cache blocks, we 
can assign memory blocks 0, 4, 8, and 12 to cache 
block 0, blocks 1, 5, 9, and 13 to cache block 1, and 
the like. So the cache block number is calculated by 
. The mapping is called direct mapped cache, and 
depicted in Figure 6. 


Figure 6 A Direct Mapped Cache with 4 Cache 
blocks for a 16-block Memory 


Since many memory blocks share the same cache 
block, how do we know which particular memory 
block is stored in a cache block? Initially, if no 
blocks reside in cache, we need to differentiate 
empty cache blocks from non-empty ones. The high 
bits from the memory block addresses may be used 
to identify a particular memory block. In Figure 6, 
the memory block address contains 4 bits (0000, 
0001, 0010, ..., 1111). The two low bits are actually 
block address (why?), and the high two bits are all 
different for all memory blocks that go to the same 
cache block. For example, the memory blocks at 
addresses 0011, 0111, 1011, and 1111 all go to the 
cache block 3 (. Their high bits, 00, 01, 10, and 11, 


are all different. These high bits used to identify 
memory block are stored in cache and called tags. In 
addition to tags, we may use one extra bit to 
indicate if a cache block is empty or not. This bit is 
called valid bit. Initially, the valid bit is 0. Once a 
memory block is brought in, the valid bit is set to 
one. These are data structures required to maintain 
a cache, and are considered as overhead. 


Address Translation 


Normally, the size of a block is two to the powers, 
i.e., 2, 4, 8, 16, 32, etc. Consider a 16-byte block, 
and there are 32 cache blocks. Given a memory 
address, the first 4 bits will be the offsets of the 
bytes in the block, the following 5 bits will be the 
block address, and the rest will be the tag. The 
number of bits for the offset is the logarithm of the 
block size, and the number of bits for the block 
address is calculated by the logarithm of the number 
of the total cache blocks. It turns out that the mod 
operation for finding block address is actually just 
an aggregated signal from the address bus. 


u Block Address Offset 


Figure 7 Address Decomposition for a Direct 
Mapped Cache 


Beside space for tags and status bits (valid and 


dirty), there are one comparator and one AND gate 
to quickly determine if a requesting block is in 
cache (hit). Figure 8 illustrates a hardware design 
for quickly determination of a cache hit. First, the 
block address is used to index a potential cache 
block. Second, the cache entry (data, tags, and 
status bits) are read from the cache. Third, the tags 
are compared by a hardware comparator. If the tags 
are identical, the output from the comparator is 
AND’d with the valid bit. If the output of the AND 
gate is high, a cache hit is determined. Otherwise, a 
cache miss occurs. 


Figure 8 Hardware Circuit for Cache Hit 


Cache Size and Overhead for Direct 
Mapped Cache Organization 


In general, the cache also includes tags, valid bit, 
dirty bit, and the actual space for storing data. 
Assume that there are a bits for an address, b bytes 
for a block, c cache blocks, one valid bit, and one 
dirty bit. In direct mapped cache organization, the 
total cache size (bytes) is calculated as follows: 


For example, in a 32-bit address system, 128 cache 
blocks, and 16-byte block, the total cache space is 
computed as follows: 


Bytes 


In the direct mapped cache organization, the space 
overhead, i.e., the percentage of the total cache bits 
used for data structure maintenance other than the 
actual storage for data, is computed as follows: 


In the previous example with , the space overhead is 
15.23%. In 48-bit address CPU, the overhead is 
illustrated in Figure 7. Obviously, the larger block 
size, the less overhead would be. However, large 
blocks may not always be good as the content tends 
to be polluted. Thus, the cache miss penalty will 
overturn the small overhead benefit. 


Figure 9 Overhead of Direct Mapped Cache in a 48- 
Bit Address CPU with Different Block Sizes 


Another interested design tradeoff is what the best 
block size will be if a piece of memory is given for 
cache. The following formula shows the relation 
between memory size and other parameters: 


By and large, the block size is one less power of two 
because of the overhead. For example, given 64 KB 
memory with 48-bit addressing, if the number of 
cache blocks is 1024, then the block size should be 
32. Because , so . The block size will be =32. If we 
want 512 cache blocks, , so The block size will be . 
We may also use this relation to find the number of 
cache blocks if the block size is fixed. 


Block Size Consideration 


Larger blocks should reduce miss rates due to spatial 
locality. However, in a cache of fixed size, larger 
blocks also mean fewer of them are available. 
Therefore, it increases competition resulting in 
higher miss rates. The other concern about large 
blocks is data pollution, i.e., data in a block is 
modified. If a cache block is not modified at all, the 
system will not necessarily copy it back to the main 
memory should it were selected for eviction. Larger 
blocks have higher chances to be modified than 
small blocks. Thereby, the system may end up with 
longer miss penalty as the whole evicted block has 
to be copied to memory. The larger miss penalty 
may override the benefit of reduced miss rates. 


Cache Miss 


On cache hits, CPU proceeds normally. On a cache 
miss, CPU pipelines stall. The requested block is 
fetched from the next level in the memory 
hierarchy. In a Harvard architecture with physically 
separate storage for instruction and data, after an 
instruction cache miss, the instruction will be re- 
fetched again. After a data miss, a complete data 
access resumes. 


Write Through of Write Back 


On a data write hit, if the cache block is modified 
but the memory block is left intact. It creates 
inconsistency between cache and memory. The 
“write-through” strategy also updates the 
corresponding memory block. However, doing so 
would increase the time for write instructions. For 
example, if the base CPI (cycle per instruction) is 
one, 10% of the instructions are write, and memory 
access takes 100 cycles, the effective CPU becomes 
11 as is calculated in the following: 


One solution for this increased CPU is write buffer, 
which accumulates data awaiting to be written to 
memory. CPU continues its execution until the 
memory buffer is full. At that time, the memory 
buffer is flushed to the memory. A benefit from the 
write through is that the evicted cache blocks may 
simply be dropped with copying back to the 
memory. 


Another solution is “write back” strategy. On a data 
write hit, only the cache block is updated. A dirty 
bit is set for the cache block, indicating that the 
cache block has been modified. When a dirty cache 
block is evicted, the whole block is written back to 
the memory. 


Set Associative Cache 


One of the problems in direct mapped cache is that 
the block address is uniquely determined, i.e., each 
memory block goes to one and only one location in 
cache. Consider the case that CPU references to 
memory blocks that go to the same cache block. For 
example, for 4 cache blocks, CPU references 
memory blocks of addresses that are multiple of 4, 
say 0, 8, 0, 8, 0. In this scenario, even though there 
are 4 cache blocks, there is always a cache miss 
because each memory blocks go to cache block 0 if 
loaded. Therefore, we may create some freedom for 
each memory block. A set of cache blocks is created 
for holding memory blocks which have the same set 
address. In the previous example, if we divide 4 
cache blocks into 2 sets, and each set contains two 
slots for blocks, the access pattern with block 
address 0, 8, 0, 8, 0, would result in less number of 
cache misses. Figure 10 illustrates a two-way set 
associative cache performance for the above access 
pattern. The number of ways is determined by the 
number of blocks in a set. 


Figure 10 Cache Miss Analysis for a Two-Way Set 
Associative Cache 


The set address is calculated by the block address 
modulo the number of sets. So the address 
decomposition becomes the following: 


B Set Address Offset 


Figure 11 Address Decomposition in an n-Way Set 


Associative Cache 


The number of bits in the offset field is equal to . 
The number of bits in the set address field is 
calculated by . The rest of the address bits will 
belong to the tags. Since a memory block may go 
any place in a set, the hardware is complex as the 
tags in a set have to be compared against a given 
tag in parallel without performance loss. Figure 12 
depicts a hardware design for a two-way set 
associative cache with 8 sets. Note that the control 
input for the multiplexer are from each way. There 
may need an encoder the convert the signals 
suitable for the multiplexer. In this case, there is a 
2x1 encoder (omitted in the diagram) for the 2-way 
set associated cache. 


Figure 12 A Hardware Design for Two-Way Set 
Associative Cache with 8 Sets 


We may develop 2-way, 4-way, 8-way, or n-way set 
associative cache following the above design to 
provide more slots for a memory block could go. 
However, increased associativity decreases miss rate 
but the hardware cost would be extremely high. 
Simulation results shows that the performance is 
getting saturated after 2-way. Practically, 4-way is a 
common denominator. 


Fully Associative Cache 


If we push to the limit, there is only one set and a 
memory block may go anywhere. This is called fully 
associative cache organization. Consider that there 
are 16 cache blocks. We may have eight 2-way sets, 
four 4-way sets, two 8-way sets, or one 16-way set. 
In this case, the 16-way set associative cache is fully 
associative. Since a memory may go anywhere in 
the set, the fully associative cache relies on 
expensive hardware to compare the tags in parallel. 
Due to its high cost, and practically 4-way would be 
the best cost/performance, the fully set associative 
cache is hardly realized commercially. 


Replacement Policies 


In any cache organization (direct mapped, set 
associative, or fully associative), when there is no 
room for a newly requested memory block, a cache 
block will have to be evicted according some 
replacement policy. In this section, we introduce 
three replacement policies: Random, First-in First- 
out (FIFO), Least Recently Used (LRU), and Most 
Recently Used (MRU). The random replacement 
policy selects a victim block randomly. It is 
normally implemented on a simple hardware circuit 
that generates a random number as an index to the 
victim block. 


The FIFO policy keeps blocks based on their ages. 
The older blocks, those loaded early, will be evicted 


first. The policy is designed according to the way a 
sequential program is executed. Consider a loop 
structure that exactly fits in the cache. The ones 
evicted should be anything before entering the loop 
because they are older. 


The LRU policy keeps track of reference records for 
each block in the cache. Basically, if a block is 
referenced, it is the latest use, and will not be 
replaced. A typical implement will be a list that 
records the reference pattern. If a block is 
referenced, it is brought to the head of the list. The 
block in the tail of the list is the least recently used 
block, and will be evicted if a new block is find a 
room. Each block will stay in the list at least as long 
as the length of the list, i-e., a block will stay in the 
list for the “length” time units if it is only referenced 
once. 


The MRU policy states that the evicted block will be 
the block which has recently referenced. The MRU 
is the opposite of the LRU. In cases that a loop is 
followed by some sequential statements and the 
loop will be executed again, the MRU will perform 
better. The implementation of the MRU is identical 
to the LRU except the evicted block will be chosen 
from the head of the list. 


Multilevel Cache 


Larger caches have a better hit rate but longer 
latency. To remedy this issue, many 


CPU designs apply multiple level of caches in which 
larger slower caches are used to back up smaller 
faster caches. Multilevel caches are performed by 
checking the first level (L1). If the requested block is 
found in L1, the CPU continues execution. If L1 
cache misses occur, the second level (L2) cache is 
checked. If found, the requested block will be 
copied from L2 to L1. This process continues until 
the requested block is checked from the external 
memory. 


Multilevel caches may be implemented in off-chip or 
on-die within CPU. For example, IBM Power4 has 
96 KB L1 cache (64 KB for instruction cache, and 32 
KB for data cache), 1.41 MB L2 cache, and 32 MB 
off-chip cache per processor. The Intel Itanium 2 has 
32 KB L1 cache (16 KB for instruction cache, and 
16K for data cache), 256 KB L2 cache, and up to 24 
MB L3 on-die cache. In the Intel Xeon MP, two 
processors share a 16 MB L3 on-die cache. The AMD 
Phenom II has 6 MB on-die L3 cache. The Intel Core 
i7 has an 8 MB on-die L3 cache shared among all 
cores. 


Cache Coherency 


In CPUs with multilevel of caches, typically caches 


with higher levels are shared among cores, each of 
which has a local cache. Therefore, a block in a 
higher level cache may have several copies sitting in 
local caches of cores. This replication reduces both 
latency of access to higher level caches and 
contention for reading a shared data item. Updates 
on a local copy have to be made on other copies. 
Otherwise, computations depend on the updates 
may fail. Cache coherency is a subject that ensures 
modified operands are prorogated to other local 
copies in a timely manner. 


A cache coherency protocol is developed to 
maintain cache consistency among all caches in a 
system with shared caches. The MESI (a.k.a. Illinois) 
is a widely used cache coherence protocol that 
supports write-back cache. It has been extensively 
used in Intel processors such as Pentium. Each cache 
block is additionally marked with 2 status bits for 
the following states. 


Modified: The cache block is present only in the 
current cache, and is dirty. Some of the values 
stored in the cache block have been modified. So, 
the cache block is required to be written back to 
main memory at some time in the future, before 
permitting any other read of the (no longer valid) 
main memory state. The write-back changes the 
block to the Exclusive state. 


Exclusive: The cache block is present only in the 


current cache, but is clean. None of the values in the 
cache block have been modified. It is identical to 
the copy in the main memory. It may be changed to 
the Shared state at any time, in response to a read 
request. Alternatively, it may be changed to the 
Modified state when writing to it. 


Shared: Indicates that this cache block may be 
stored in other caches of the machine and is clean. 
The cache block has not been modified and is 
identical to the copy in the main memory. The block 
may be discarded, i.e., changed to the Invalid state, 
at any time. 


Invalid: Indicates that this cache block is invalid. 


The purpose of caches is to minimize main memory 
access. A cache may satisfy a read from any state 
except invalid. An invalid block must be fetched to 
satisfy a read. A write may only be performed if the 
cache block is in the modified or the exclusive state. 
If it is in shared state, all other caches copies must 
be invalided first. This is done by a broadcast 
operation called Request for Ownership (RFO). A 
non-modified block may be invalided and discarded 
at any time. A modified block must be written back 
first. If a cache that holds a modified block has to 
snoop any attempt to read its memory copy. If 
found, write back the cache block to memory, and 
set it to the shared state. A cache that holds a shared 
block must listen for invalidate or RFO broadcasts, 


and discard it. A cache that hold an exclusive block 
must snoop read operations from other caches, and 
change it to the shared state if found. Since cache 
blocks are discarded without broadcasting a 
message to other caches. It is possible to have a 
cache shared block that is only used by one 
processor. 


Multiprocessing 

This chapter introduces common multiprocessing 
models such as SMP, GPU, SIMD vs. MIMD, 
Amdahl’s Law, and the Flynn Taxonomy. Discuss the 
concept of parallel processing beyond the classical 
von Neumann model. Describe alternative parallel 
architectures such as SIMD and MIMD. Explain the 
concept of interconnection networks and 
characterize different approaches. Discuss the 
special concerns that multiprocessing systems 
present with respect to memory management and 
describe how these are addressed. Describe the 
differences between memory backplane, processor 
memory interconnect, and remote memory via 
networks, their implications for access latency and 
impact on program performance. 


Multiprocessing 


A computer system is possible to have more than 
one processor. Each processor is running an 
instruction at the same time (or in parallel). The 
term multiprocessing is used to describe such 
system. It is certain that a multiprocessing system 
will process multiple instructions at each time 
instant. Unlike multiprocessing, a single processing 
system only has one instruction running at each 
time instant. From the program’s perspective, 
multiprocessing can also mean a program is 


dynamically assigned to several processors working 
in tandem. Theoretically, a two-processor system 
will be running twice as fast as a single-processor 
system. However, practically, there are issues such 
as communication cost among processors, and 
parallelization. Some problems may not be 
parallelizable. It is inevitable to synchronize each 
process’ execution according to algorithms of the 
program in question. This cost may outweigh the 
speedup from multiprocessing. There are two types 
of multiprocessing: symmetric multiprocessing 
(SMP) and massively parallel processing (MPP). 


Symmetric Multiprocessing 


The SMP is tightly coupled multiprocessing, in 
which the processors share a single memory and I/O 
bus or data path. An operating system manages task 
assignment and management for all the processors. 
Due to the advent VLSI technology, several 
processors may be packed in a chip called multi- 
core processors. When the SMP architecture applies 
to the multi-core processor, each core is treated as a 
separate processor. Buses, crossbar switches or on- 
chip mesh networks are used to interconnect 
processors in SMP. Processors interconnected by 
buses or cross bar switches may not scale well due 
to the low bandwidth and the high power 
consumption of the interconnection. On-chip m esh 
architectures avoid these drawbacks, and provide 


nearly linear scalability to much higher processor 
counts. 


Figure A Typical SMP System 


Hardware Support for SMP 


Since the processors in SMP have to communicate 
each other to achieve a task, a multiprocessing 
protocol has to be implemented. This protocol 
dictates how the processors talk to each other to 
perform a computation. An Advanced 
Programmable Interrupt Controller (APIC) protocol 
is provided and implemented by Intel in its 
processors such as the Pentium and Pentium Pro. 
Intel chipsets supporting multiprocessing include 
430HX, 440FX and 450GX/KX. APIC is a proprietary 
standard that prevents other CPU manufactures such 
as AMD and Cyrix from building processors that 
would communicate with Intel’s processor in the 
SMP model, though they may build an x86 
compatible processors.In addition to the APIC 
protocol, OPEN Programmable Interrupt Controller 
(OpenPIC) is proposed by AMD and Cyrix for their 
multiprocessing protocol. It can support up to 32 
processors. However, very few motherboards 
actually implement the OpenPIC. AMD had licensed 
the APIC for its Athlon and later processors. 


A shared or non-shared level 2 cache in SMP 


processors has direct impact on its performance. For 
example, in processors such as the Intel Pentium Pro 
or Pentium II, the self-contained level 2 cache 
makes them a better choice for multiprocessing. If 
the level 2 cache is built on motherboard, it will be 
shared by each added processor and the 
performance will be degraded. Moreover, 
multiprocessors with a level 3 cache would be a best 
candidate to build a SMP system. Currently, such 
processors include Intel Core 2Quad, Core i7 (6 
cores), AMD Propus (four AMD K10 cores), and 
AMD Thuban (6 AMD K10 cores). 


Massively Parallel Processing 


Processors can be loosely coupled to form a 
computing platform. A massively parallel processing 
(MPP) machine is such a computing platform with 
many networked processors, typically more than 
200 processors. MPPs share many of the same 
characteristics as clusters, but MPP has a high speed 
interconnect network that allow communication 
among each individual processor, where a different 
operating system may be running. Each processor in 
MPP contains its own memory running a 
coordinated task. The number of processors in MPPs 
tends to be larger than that of clusters. 


Distributed Computing 


A program may be executed and distributed among 
a set of network connected computers. A distributed 
computer (a.k.a. a distributed memory 
multiprocessor) is composed of a multiple 
autonomous computers which communicated each 
other via a computer network. A program written 
for a distributed computing platform has to be 
divided into many coordinated tasks, and each of 
them is distributed to one or more computers in the 
system. The partial results from each computer are 
collected and are used to build the final result for 
the problem in question. Distributed computing is 
highly scalable. 


Cluster Computing 


Typically, a cluster is a group of identical network 
connected computers that work together closely to 
run an application that solve a problem. To some 
extent, this group of computers can be regarded as a 
single computer. The interconnected network in a 
cluster is normally, but not always, a fast local area 
network. Each computer in a cluster may not be 
symmetric, but in that case, load balancing becomes 
difficult. One of the advantages of cluster computing 
is its low cost compared to a single super computer 
with comparable speed. For example, the Beowulf 
cluster, a typical setting for cluster computing, is a 


cluster composed of multiple identical commercial 
off-the-shelf computers (such as PCs) network 
connected via the TCP/IP Ethernet. 


Grid Computing 


Grid computing is a collection of computing 
resources from multiple administrative domains so 
as to solve a difficult problem. The grid can be 
regarded as a distributed system with each loose 
coupled computer linked from different 
geographical areas. What deviated grid computing 
from traditional high performance computing 
platforms, such as cluster computing, is that it tends 
to be more loosely coupled, heterogeneous, and 
geographically dispersed. Once a grid is built, it may 
be running a variety of applications or a dedicated 
application. Middleware, a set of library objects 
including communication, and synchronization 
tools, is typically used to construct and manage a 
general-purpose grid. 


The size of a grid varies. Grids are a form of 
distributed computing composed of many 
networked loosely coupled computers working 
together to perform a very large task. Therefore, the 
term “distributed” or “grid” computing, is a special 
type of parallel computing that relies on multiple 
computers, each of which has onboard CPUs, 
storage, power supplies, network interfaces, etc. 


They are connected via a conventional network 
interface, such as Ethernet. In contrast, the 
traditional notion of a supercomputer is build on 
many tightly connected processors through a local 
high-speed computer bus. 


Multiprocessing versus 
Multiprogramming 


Multiprocessing should not be confused with 
multiprogramming, or the interleaved execution of 
two or more programs by a processor. Today, the 
term is rarely used since all but the most specialized 
computer operating systems support 
multiprogramming. Multiprocessing can also be 
confused with multitasking, the management of 
programs and the system services they request as 
tasks that can be interleaved, and with 
multithreading, the management of multiple 
execution paths through the computer or of multiple 
users sharing the same copy of a program. 


Multiprocess versus Multithread 


In order to take advantage of multiprocessing, 
application software is designed specifically. 
Normally, a program has to be decomposed into 
different tasks, which are implemented into 


processes or threads. A process is a running program 
created by the operating system whereas a thread is 
a light weight process. A process may spawn several 
threads, each of which has its own control and 
stacks. The main difference between a process and a 
thread is that a process is the basic execution unit 
with a complete set of memory segments such as 
code, data, heap and stack, whereas a thread may 
have only a stack segment, and shared other 
memory segments with a process. The larger process 
image has a direct impact on the cost of context 
switching. This is the reason why a thread is called 
a light weight process. Multiprocessing is managed 
by the operating system, which assigns task to 
processes or threads, and schedules them to be 
executed by the processors in the system. Programs 
designed for multiprocessing are called 
multithreaded applications. Each thread may be 
executed independently working on a small task. By 
allowing the OS to schedule several threads to 
simultaneously run on the available processors in 
the system will improve performance. However, 
should an application is not or cannot be designed 
this way, obviously only one of the multiple 
processors is active at a time. 


Multitasking 


The term multitasking is referring to running 
multiple applications at a time. Most contemporary 


operating systems such as Unix and Windows are 
multitasking. Although the operating systems seem 
to run multiple programs simultaneously, there may 
have only one processor. The illusion is created by 
using time sharing technique, which iteratively runs 
a small time tick for each program. Such system 
may not be multiprocessing because there is only 
one processor. It is perfectly okay to have a 
multitasking system over a multiprocessing system. 


Asymmetric versus Symmetric 


How the operating system distributes tasks among 
the available processors in a multiprocessing system 
can be either asymmetric or symmetric. Asymmetric 
multiprocessing reserves some processors for system 
tasks only, and others are for user applications only. 
This inflexible design may degrade performance in 
cases that there are systems tasks but without 
available system processors to be scheduled with, or 
vice versa. Symmetric multiprocessing, on the other 
hand, allows a flexible tasks-to-processors 
allocation. Either system or user tasks may be 
scheduled on any available processor. T herefore a 
better performance may be attained. 


Cloud Computing 


Cloud computing provide a virtual platform that 
integrates applications, data management, and 
storage. The end-users of the virtual platform do not 
have the knowledge of where the servers are located 
physically, and nor is the system configuration. 
Because most of the computing is done in the cloud, 
the end-users only need to have a very thin client, 
such as a netbook or handheld device. Cloud 
computing is a new IT service model over the 
Internet with dynamic scalability and virtual 
resources. It is designed based on a client/server 
paradigm, which allows users to access the service 
via a regular web browser. This computing model 
shifts the burden of software/hardware 
management, upgrades, backups, and 
administration from end-users to the service 
provider. A net result would be lowering the cost to 
the users with a better cost/performance ratio. 


Applications in cloud computing are delivered by 
their service providers via the Internet. Business 
software and data are stored in the place where the 
servicer is located. Any device with the ability to 
run a browser would be able to access the service 
without limitation of time and location. A cloud 
computing may be associated with a data center 
which stores business data and provide data 
accesses via web-based technologies such as AJAX. 
It is also possible to deliver legacy applications 
using a screen-sharing technology. 


The idea of cloud computing is to minimize 
infrastructure cost, and maximize service sharing. 
An obvious gain is that an enterprise would totally 
eliminate the cost to maintain its infrastructure such 
as computers, servers, software, and their 
administration. Running business applications in 
cloud computing is normally faster with easier 
maintenance. This allows IT to quickly response to 
the unpredictable business demand. 


Amdahl’s Law 


It is not always true that a system’s performance 
may be improved by adding more and more 
processors. Speedup is a term that is used to gauge 
the improvement over the old system int terms of 
execution time. 


speedup = time before improvement time after 
improvement 


The theoretical upper bound of speedup is 2 by 
adding one processor to a one-processor system. 
That says the extra processor will take one half of 
load evenly. However, in practice, the work load 
may not be evenly divided, and sometimes the load 
may not be divided at all. In such case, the speedup 
is much less than the theoretical upper bound. 


Gene Amdahl proposed the Amdahl's law, also 


known as Amdahl's argument, which is used to find 
the maximum expected improvement to an overall 
system when only part of the system is improved. It 
is often used in parallel computing to predict the 
theoretical upper bound of speedup when using 
multiple processors. 


If a program is consisting of a sequential part and a 
parallel part, the speedup running it on multiple 
processors is dominated by the time needed for the 
sequential part of the program. The rationale is that 
the parallel part may be executed on x processors, 
which results in the execution time of the parallel 
part to be 1/X. The total execution time, however, is 
max(1X,Y), where Y is the execution time for the 
sequential part. 1/X is approaching to zero if 
unlimited processors are used. The total execution 
will still be dominated by the sequential part, which 
IS-Y. 


For example, if a program requires 100 hours 
running on a single processor core, and a sequential 
part that needs 5 hour, no matter how many 
processors used to the parallel part (95 hours), its 
minimum execution time may not be less than the 
sequential 5 hour. The speedup is calculated as 
follows: 


speedup = time before exeuction time after 
execution = 1005 + 95X < 20 


The speedup 20x tells a fact that this program will 
be running up to 20 times faster if it is running on a 
parallel machine. Sometimes, the number may be 
smaller and it is arguable whether the program 
should be parallelized! 


In the above equation, if both the denominator and 
numerator are divided by 100, the following 
equation is derived: 


speedup = time before exeuction time after 
execution = 100 /1005/100 + 95/100 X 


Let P be the parallel portion. The above equation 
becomes 


speedup = time before exeuction time after 
execution = 1(1—-P)+PX 


This is Amdahl’s law which emphasizes the 
improvement that affects the parallel proportion P. 
The X is the number of processors added. In general, 
X can be the speedup for the parallel part. For 
example, if an improvement affects the parallel part, 
which is rated to 20% of the total computation, P 
will be 0.2. If the improvement makes the parallel 2 
times faster, X will be 2. Amdahl's law states that 
the overall speedup of applying the improvement 
will be: 


speedup = 1(1-—-P)+PX=1(1-0.2) + 
0.22=109=109=1.1 


Many-core Processing - Graphics 
Processing Unit 


At present, nearly all microprocessor manufacturers 
adopt chip-level multiprocessor (CMP) as a way to 
design and manufacture current and next generation 
processors. With CMP, multiple processor cores are 
contained in a single chip. Currently, these 
multicore processors have already gained 
widespread adoption in various computing 
platforms including game consoles, hand-held 
devices, high-performance computers, network 
devices, and mission-critical embedded systems. The 
use of multicore processors in these systems may 
result in several benefits. First, the use of multicore 
processors can lower energy consumption because 
the computation efforts may be distributed to 
multiple cores that are running at lower clock speed 
than that of the monolithic counterpart. Lower clock 
speed can significantly reduce energy consumption 
of these systems. Second, the use of multicore 
processors provides an opportunity to exploit 
thread-level parallelism. When multithreaded 
programs are executed on these processors, multiple 
threads may concurrently run on multiple cores. 
Shared data can also be accessed efficiently using 
heap memory. 


A graphics processing unit (GPU) is a specialized 


processor, consisting of thousands of processing 
cores in a chip, designed to rapidly manipulate and 
alter video memory that contains information for 
displaying pixels on computer monitors. GPUs are 
used virtually in any system with a display device 
including embedded systems, mobile phones, 
personal computers, workstations, and game 
consoles. The latest development in modern GPUs 
creates a platform for general high performance 
computing because their efficient processing at 
manipulating large blocks of data in parallel. 
Programming on GPUs for general purpose 
computing can be really hard before the 
development Compute Unified Device 
Architecture (CUDA) by Nvidia. With CUDA, a 
programmer may be crated millions of threads 
running on thousand of cores in GPU easily. 


Flynn's Taxonomy 


Flynn’s taxonomy was proposed by Michael J. Flynn 
to be one of the earliest classification systems for 
parallel/sequential computers and programs. Under 
Flynn’s taxonomy, programs and computers are 
classified by whether they were operating using a 
single set or multiple sets of instructions, whether or 
not those instructions were using a single or 
multiple sets of data. The following table listed the 
Flynn’s taxonomy. 


Table Flynn's Taxonomy 


Single Multiple 

Instruction Instruction 
Single-Data SISD Misp 
Multiple Data SIMD MIMD 


The single-instruction-single-data (SISD) 
classification is equivalent to an entirely sequential 
program. The single-instruction-multiple-data 
(SIMD) classification is analogous to doing the same 
operation repeatedly over a large data set. Video 
data processing belongs to this category as a huge 
amount of video data is typically processed by an 
instruction, which may, e.g., change its intensity. 
Multiple-instruction-single-data (MISD) is a rarely 
used classification. For example, systolic arrays 
process single data stream by multiple instructions, 
which may reduce time complexity for an array 
multiplication to linear from cubic. Multiple- 
instruction-multiple-data (MIMD) programs are by 
far the most common type of parallel programs. 


Owning to its simplicity and understandability, 
Flynn’s taxonomy has been widely used to 
classifying parallel computers, though some 


machines are hybrids of these categories 


Hardware Security 

Hardware security as a discipline originated out of 
cryptographic engineering and involves hardware 
design, access control, secure multi-party 
computation, secure key storage, ensuring code 
authenticity, measures to ensure that the supply 
chain that built the product is secure among other 
things. A hardware security module (HSM) is a 
physical computing device that safeguards and 
manages digital keys for strong authentication and 
provides cryptoprocessing. These modules 
traditionally come in the form of a plug-in card or 
an external device that attaches directly to a 
computer or network server. The study of hardware 
security starts from understanding topics such as 
Hardware/Firmware Security: Root of Trusts, 
Firmware Worms, BIOS/UEFI, and Chipsec. This 
chapter gives an overview of hardware security. 


Overview 


This chapter aims to give an overview of hardware 
security issues with Information Assurance and 
Security (IAS) learning materials from the systems 
perspective, including hardware, application 
programming interface, and operating systems. 
According to the guidelines on hardware rooted 
security in mobile devices [1], mobile devices are 
required to implement the following fundamental 


security primitives: roots of trust, an application 
programming interface (API) to the platform, and a 
policy enforcement engine. We believe it is 
important to verify these primitives’ integrity with 
hands-on experiments. Another focus in this module 
is low level programming (assembly programming 
in X86, ARM, or MIPS), and software reverse 
engineering, which are mandatory requirements for 
designation as a center of academic excellence in 
cyber operations by National Security Agency [2]. 
We will study hardware security in this module 
along with its countermeasures. We will 
demonstrate a tool, Intel’s Chipsec, to validate PC 
BIOS in terms of known vulnerabilities. Some tools 
such as OllyDbg [3] and IDApro [4] may be used to 
inspect memory contents. 


Commonly Asked Terms 


First of all, we will study firmware worms such as 
CIH that attacks PCs and ThunderStruck2 that may 
attack MACs. Let’s first examine some common 
terms in this subject. 


What is firmware? 


Firmware is a piece of software stored in read-on- 
memory (ROM) or flash memory that comes with 
hardware. For example, each time you run a 


MSP430 program in IAR Embedded Workbench and 
the FET debugger is set, your program actually is 
transferred to the flash memory in the MSP430 chip. 
So the program is there, even though your turn off 
the power to the hardware. When a computer starts, 
its firmware will be executed first followed by 
loading an operating system such as Windows or 
MAC OS. Firmware also exists in peripherals such as 
harddrives, network cards, USB memory sticks, 
MSP430 launch pads, etc. In PC, it is called BIOS 
that stands for basic input output system 


What is CIH? 


CIH is a computer virus developed by a Taiwanese 
college student in 1998. This virus erases the first 
megabyte of a hard drive and PC BIOS firmware. It 
causes machines to hang or cue the blue screen 
death. Zero out the first megabyte of a hard drive 
will delete partition tables and master boot record 
(MBR), which is why CIH causes computers not 
bootable. 


How does CIH spread? 


It hides itself in a Portable Executable (PE) file 
under Windows 95, 98, and ME. It does not spread 
via Windows NT-based operating systems such 
Windows XP, 7, 8, and 10. The virus code size is 1K 
that is chopped into small slivers and inserted into 
unused space in the tail of a PE header. As a result, 


the infected files do not grow their size at all. 


What is Thunderstrike 2 worm? 


The Thurderstrike 2 is a firmware type of worms 
created by Xeno Kovah et al. to prove that MACs 
may be attacked via an Apple Thurderbolt Ethernet 
adapter. The worm hides on the option ROM on the 
Thurderbolt Ethernet adapter, which will be loaded 
and infect MAC’s firmware when connected. 


How does the Thurderstrike 2 spread? 


An attacker could compromise the boot firmware on 
MacBooks via phishing email or malicious web site. 
The compromised MacBooks will spread the worm 
by looking at any connected devices, such as Apple 
Thurderbolt Ethernet adapters, that have an option 
ROM. When the infected devices are inserted to 
other computers, they will load the optional ROM, 
which triggers flashing their boot firmware with the 
worm. 


Why is the firmware worm difficult to be 
detected and removed? 


Most the anti-virus software does not have the 
privilege to scan the firmware simply because its 
operations rely on the firmware. Moreover, the 
firmware may disguise itself by reporting normally 


for any requests from upper level applications. This 
makes it difficult to detect. Also, the firmware is 
basically part of hardware. Unless explicitly flash a 
clean firmware, re-installing OS will not remove the 
worm sitting in firmware. 


Why firmware worms can attack different 
computers including MACs, Dell, Lenovo, 
Samsung, and HP? 


Motherboard manufacturers typically create their 
firmwares from the same reference implementations 
such as EFI, UEFI, BIOS, AMI, DTK, Award, Lenovo, 
and Phoenix. If someone finds a bug in one, the 
vulnerability will exist in others with a high chance. 


How to find out BIOS information on my 
computer? 


The first way is by restarting your computer. When 
the initial load (also called POST) screen is 
displayed, the BIOS Type and version is also 
displayed. If the load screen is displayed for only a 
few seconds, you can try pressing the Pause/Break 
key on your keyboard to pause the loading process. 
Pausing the screen should make it easier to find and 
read the BIOS information. When you're ready to 
resume the boot process, press Pause/Break again. If 
you want to change BIOS settings, typically press 
“DEL” and “Enter” key will enter BIOS. Below is my 
BIOS screenshot. 


Second, the BIOS information is also shown through 
the Windows System Information. To open this tool 
click START, Programs, Accessories, System Tools, 
and then System Information. In the System 
Information window, displaying information about 
your computer, including the type of BIOS you have 
and the version, under the System Summary section. 
As can be seen in the picture below, this computer 
has a BIOS Version/DateLENOVO 9QKT31AUS, 


10/26/2011. 
File Edit Ip 
System Item value 
OS Name Microsoft Windows 7 Professional 
Version 6.1.7601 Service Pack 1 Build 7601 
Other OS Descriptic Not Avail 
OS Manufacturer Mic sft Corporatior 
|| System Name KSUP27714 
‘System Manufacturer LENOVO 
System Model 3156A47 
system Type x64-based PC 
Processor Intel(R) Pentium(R) CPU G620 @ 2.60GHz, 2600 Mhz, 2 Core(s), 2 Logical Proc... 
BIOS Version/Date LENOVO 9QKT31AUS, 10/26/2011 
SMBIOS Version 26 
Windows Directory C:\Windows 
system Directory ‘C\Windows\system32 
Boot Device \Device\Harddiskvolumet 
Locale United States 
Hardware Abstraction Layer Version = "6.1,7601.17514" 
User Name WIN\dlo2 
Time Zone Eastern Standard Time 
Installed Physical Memory (RAM) 4.00 GB 
Total Physical Memory 3.85 GB 
Available Physical Memory 2.01 GB 
Total Virtual Memory 7.70 GB 
Available Virtual Memory 4.54 GB 
Page File Space 3.85 GB 
Page File C\pagefile.sys 


search selected category only [search category names only 
] 1035aM | 


© ens 


Lastly, you can also find BIOS information in the 
Windows System Registry. While in the registry 
realize that improperly changing a setting can affect 
how Windows operates. So be careful if you choose 
to use this option to view your BIOS information. 


To access the System Registry, click START and in 


the Run or Search box type regedit in the text field 
and press enter. In the Windows Registry navigate 
to the below Registry directory. 


HKEY_ LOCAL MACHINE\HARDWARE 
\DESCRIPTION\System 


Find the subkeys SystemBiosDate and 
SystemBiosVersion to see the BIOS and version for 
your motherboard. As can be seen in the picture 
below, the BIOS date and version are shown in these 
two keys (blue highlighted). 


les Est Vew = Favortes se 
‘4 7M Computer Ni 
Ub HKEY_CLASSES ROOT | 5) Dafa 

Jj) HKEY_CURRENT_USER. 
4) HKEY_LOCAL_MACHINE | 


REG_SZ (value not set) 

REG_DWORD 0,0000000: 

REG_DWORD 0,00038425 (230565) 

REG BINARY (00 00 00 00 00 00 00 00 00 00 00 00 00 09 00 00 


|) HARDWARE | 
>) ACPI REG_FULL_RESOU... ff ff ff fff Ff ff fF 00 00 00 00 02 00 00 00 05 00 00 00... 
4») DESCRIPTION REG_SZ AT/AT COMPATIBLE 
|, System REG_DWORD 00000001 (1) 
DEVICE! 


REG_SZ 10/20/10 
bo) RESOURCEMAP | REG_MULTLSZ LENOVO - 11f LENOVO BIOS Rev: 10 
sam = REG_SZ 05/16/20 
J) SECURITY || 28) VideoBiosVersion REG_MULTISZ Hardware Version 0.0 
J) SOFTWARE 


J) HKEY_USERS 
Jj) HKEY_CURRENT_CONFIG. 


Computer\HKEY_LOCAL_MACHINE\HARDWARE\DESCRIPTION\System 


a = | (wi [=] > 10:42AM | 
| fo , u - 0) 
= 0 Bs —<— 20 syns 


What is the countermeasure? 


Don’t buy unknown devices or plugin devices for 


your computers. Infected firmware has to be re- 
flashed to remove the worm. This may require 
special hardware/software tools such as JTAG flash 
programmer. 


NIST's (National Institute of Standards and 
Technology) has published two firmware protection 
guidance documents: BIOS protection guidelines and 
BIOS integration measurement guidelines. 


Install Windows Driver Kit Version 7.1.0 and build 
Chipsec Windows driver. 


BF Zimbra: Inbox (2835) % — G Installing CHIPSEC in Wind... X [M1 Inbox G14) -zoelo01@gm... % / BE Download Windows Driver... x \ + (sles) 
€ ) B wow microsoft.com/en-us/download/confirmation.aspxtid=11800 & || search w/a +f 8 = 
HE Microsoft store ~ Products ~ Support ~ Search Microsoft.com 2) BW signin 


Download Center Windows Office Web browsers Developer tools 


EB Don't miss our limited-time Cyber Monday deals. Shop now > 


rallies 7 @) 
as Thank you for downloadin 


Discimagefile:  GRMWDK_EN_7600_1150 


Disc burner: DVD R Drive (D:) 
Windows Driver Kit Version 7.1.0 


If your download does not start after 30 seconds, Click here — 
Burning disc image to recordable disc. 
install Instructi 
is jenuine Micr 
ali short validatio 
i 
4 


Verify disc after burning Windows PCs 


Shop the latest PCs, just in time 
for the holidays. 


830M | 
11/30/2015 
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Running Chipsec: 


C:\dan\projects\PLab\SystemFundamentals 
\chipsec-1.1.0\chipsec-1.1.0\source\tool > 


python chipsec_main.py 
[helper] OS: Windows 7 6.1.7601 


[helper] Using 'helper/win/win7_amd64' path for 
driver 


WARNING: 

G[S]etFirmwareEnvironmentVariableExW function 

doesn't seem to exist 
HHHEHHHHHAHEHAHEHAEHAAAAHEHAAHAAAAAAAAHA 
H# ## 


## CHIPSEC: Platform Hardware Security 
Assessment Framework ## 


HH ## 
HHHHHAHAHAAHAAHAAAAAAHAAAAAAAHAAAAAAS 
Version 1.1.0 


WARNING: Chipsec should only be used on test 
systems! 


WARNING: It should not be installed/deployed on 
production end-user systems. 


WARNING: See WARNING. txt 


OS : Windows 7 6.1.7601 AMD64 


Platform: Desktop 2nd Generation Core Processor 
(Sandy Bridge CPU / Cougar Point 


PCH) 

VID: 8086 

DID: 0100 
CHIPSEC : 1.1.0 


[*] loading common modules from ".\chipsec 
\modules\common" .. 


[+] loaded 
chipsec.modules.common.bios_kbrd_buffer 


[+] loaded chipsec.modules.common.bios_ts 
[+] loaded chipsec.modules.common.bios_wp 
[+] loaded chipsec.modules.common.smm 
[+] loaded chipsec.modules.common.smrr 
[+] loaded chipsec.modules.common.spi_lock 


[+] loaded 
chipsec.modules.common.secureboot.keys 


[+] loaded 
chipsec.modules.common.secureboot.variables 


[*] loading platform specific modules from ". 
\chipsec\modules\snb" .. 


[*] loading modules from ".\chipsec\modules'" .. 
[+] loaded chipsec.modules.module_template 
[*] running loaded modules .. 


[+] imported 
chipsec.modules.common.bios_kbrd_buffer 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules\common 
\bios_kbrd_buffer.py 


Buffer 


[*] Keyboard buffer head pointer = 0x32 (at 
0x41A), tail pointer = 0x32 (at 0x41 


C) 

[*] Keyboard buffer contents (at 0x41E): 

20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 | 
20 00 20 00 20 00 20 00 20 00 20 00 20 00 20 00 | 


[-] Keyboard buffer tail points inside the buffer (= 
0x32) 


It may potentially expose lengths of pre-boot 
passwords. Was your password 1 


1 characters long? 
[*] Checking contents of the keyboard buffer.. 


[+] PASSED: Keyboard buffer looks empty. Pre- 
boot passwords don't seem to be exp 


osed 
[+] imported chipsec.modules.common.bios_ts 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules\common 
\bios_ts.py 


[x] 


[*] RCBA General Config base: OxFED1F400 


[*] GCS (General Control and Status) register = 
0x00000C01 


[10] BBS (BIOS Boot Straps) = 0x3 
[00] BILD (BIOS Interface Lock-Down) = 1 


[*] BUC (Backed Up Control) register = 
0x00000000 


[00] TS (Top Swap) = 0 

[*] BC (BIOS Control) register = 0x00 
[04] TSS (Top Swap Status) = 0 

[*] BIOS Top Swap mode is disabled 


[+] PASSED: BIOS Interface is locked (including 
Top Swap Mode) 


[+] imported chipsec.modules.common.bios_wp 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules\common 
\bios_wp.py 


[*] BIOS Control (BDF 0:31:0 + OxDC) = 0x00 
[05] SMM_BWP = 0 (SMM BIOS Write Protection) 


[04] TSS 


0 (Top Swap Status) 


[01] BLE = O (BIOS Lock Enable) 
[00] BIOSWE = 0 (BIOS Write Enable) 
[-] BIOS region write protection is disabled! 


[*] BIOS Region: Base = 0x00180000, Limit = 
OxOO3FFFFF 


SPI Protected Ranges 


PRx (offset) | Value | Base | Limit | WP? | RP? 

PRO (74) | 00000000 | 00000000 | 00000000 | 0 | 0 
PR1 (78) | 00000000 | 00000000 | 00000000 | 0 | 0 
PR2 (7C) | 00000000 | 00000000 | 00000000 | 0 | 0 
PR3 (80) | 00000000 | 00000000 | 00000000 | 0 | 0 
PR4 (84) | 00000000 | 00000000 | 00000000 | 0 | 0 


[!] None of the SPI protected ranges write-protect 
BIOS region 


[!] BIOS should enable all available SMM based 
write protection mechanisms or co 


nfigure SPI protected ranges to protect the entire 
BIOS region 


[-] FAILED: BIOS is NOT protected completely 
[+] imported chipsec.modules.common.smm 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules\common 
\smm. py 


[x][ Test: Compatible SMM memory (SMRAM) 
Protection 


[*] Compatible SMRAM Control (00:00.0 + 0x88) 
= OxlA 


[06] DIOPEN = 0 (SMRAM Open) 
[05] D_CLS = 0 (SMRAM Closed) 
[04] D_LCK = 1 (SMRAM Locked) 
[03] GSMRAME = 1 (SMRAM Enabled) 


[02:00] C_LBASE_SEG = 2 (SMRAM Base Segment = 
010b) 


[*] Compatible SMRAM is enabled 
[+] PASSED: Compatible SMRAM is locked down 
[+] imported chipsec.modules.common.smrr 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules\common\smrir.py 


[x][ Test: CPU SMM Cache Poisoning / SMM Range 
Registers (SMRR) 


[+] OK. SMRR are supported in 
JTA32_MTRRCAP_MSR 


[*] Checking SMRR Base programming.. 


[*] IA32_SMRR BASE MSR = 
0x00000000B7000006 


BASE = 0xB7000000 

MEMTYPE = 6 

[+] SMRR Memtype is WB 

[+] OK so far. SMRR Base is programmed 
[*] Checking SMRR Mask programming.. 


[*] IA32_SMRR_MASK_MSR = 
0x00000000FF800800 


MASK = OxFF800000 
VLD = 1 


[+] OK so far. SMRR are enabled in SMRR_MASK 
MSR 


[*] Verifying that SMRR_BASE/MASK have the same 
values on all logical CPUs.. 


[CPUO] SMRR_BASE = 00000000B7000006, 
SMRR_MASK = QOOQOQOOOOFF800800 


[CPU1] SMRR_BASE = 00000000B7000006, 
SMRR_MASK = QOOOQOQOOQOOFF800800 


[+] OK so far. SMRR MSRs match on all CPUs 


[+] PASSED: SMRR protection against cache attack 
seems properly configured 


[+] imported chipsec.modules.common.spi_lock 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules\common 
\spi_lock.py 


[x][ Test: SPI Flash Controller Configuration Lock 


[*] HSFSTS register = OxOOOOEO08 
FLOCKDN = 1 


[+] PASSED: SPI Flash Controller configuration is 
locked 


[+] imported 
chipsec.modules.common.secureboot.keys 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules\common 
\secureboot\keys.py 


[x][ Test: Protection of Secure Boot Key and 
Configuraion EFI Variables 


[*] SKIPPED: Currently this module can only run on 
Windows 8 or greater or Linux 


. Exiting.. 


[+] imported 
chipsec.modules.common.secureboot.variables 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules\common 
\secureboot\variables. py 


[*] SKIPPED: Currently this module can only run on 
Windows 8 or higher or Linux. 


Exiting.. 
[+] imported chipsec.modules.module_template 


[*] Module path: C:\dan\projects\PLab 
\SystemFundamentals\chipsec-1.1.0\chipsec-1 


.1.0\source\tool\chipsec\modules 
\module_template.py 


[+] PASSED: Test Passed 


[CHIPSEC] KKEKKKKKKKKKKKKKKKKKKKKKKKKK 
SUMMARY * KEKKKKKKKKKKKKKKKKKKKK 


[CHIPSEC] Time elapsed 0.394 
[CHIPSEC] Modules total 9 
[CHIPSEC] Modules failed to run 0: 
[CHIPSEC] Modules passed 6: 


[+] PASSED: 
chipsec.modules.common.bios_kbrd_buffer 


[+] PASSED: chipsec.modules.common.bios_ts 
[+] PASSED: chipsec.modules.common.smm 
[+] PASSED: chipsec.modules.common.smrr 


[+] PASSED: chipsec.modules.common.spi_lock 


[+] PASSED: chipsec.modules.module_template 
[CHIPSEC] Modules failed 1: 

[-] FAILED: chipsec.modules.common.bios_wp 
[CHIPSEC] Modules with warnings 0: 
[CHIPSEC] Modules skipped 2: 


[*] SKIPPED: 
chipsec.modules.common.secureboot.keys 


[*] SKIPPED: 
chipsec.modules.common.secureboot.variables 
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[CHIPSEC] Version: 1.1.0 


C:\dan\projects\PLab\SystemFundamentals 
\chipsec-1.1.0\chipsec-1.1.0\source\tool > 
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